Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
8 Statistics Contents: A B C D E cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 F G H I J black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\441SA12MA-2_08.CDR Thursday, 16 August 2007 4:44:00 PM DAVID3 Data and sampling Analysis and representation Stemplots Measures of centre The variability (spread) of a distribution Box and whisker plots Extended investigations Normal distributions Correlation Linear regression SA_12MA-2 442 STATISTICS (Chapter 8) INTRODUCTION Decisions made by governments, businesses, education departments, sporting bodies, etc., are often made after careful consideration of statistical evidence. Statistics play a vital role in many areas of our society. Statistics are a tool for helping to make rational decisions about variables described by data sets. Amongst other things, governments use statistics to help formulate future policies. Businesses often use statistics to aid decision making, for example, whether or not to enter the market with an alternative to a product when there are already several of these products on the market. Statistical information about sport has increased dramatically in recent years. We only need to watch a ‘Twenty20’ cricket match to observe the many statistical graphs and tables used to help make the viewer more informed. In advertising, ‘product superiority claims’ are frequent. Often statistical analysis can be used to analyse such claims so that we may question their validity. Following are some examples of the types of problems we may face, and where statistical methods may help answer them: ² A young executive of a hotel chain claimed that lowering the room tariff by 10% would increase the patronage by 25%. Would this be true? A manufacturer of AAA batteries claimed that her batteries outlasted all other leading brands by at least 100 hours. Is she correct? In the AFL, the umpires give more free kicks to the home team than to the other team due to the crowd’s influence.¡ What evidence do we have, and is the claim justified? Should lights be placed at a particular intersection of two roads? What factors should determine this? An employer claims that younger employees (< 30 years) have on average twice as many sick days as the older ones (> 30 years). Is he correct? Which drug for helping to quit smoking has the greatest chance of success? Does the unemployment rate affect the crime rate for that city? ² ² ² ² ² ² DISCUSSION Examine the following problems: How much will it cost each week to rent a one-bedroom flat in the Eastern suburbs of Adelaide compared with one in the Western suburbs? Problem 1: Problem 2: Has the size of harvested crayfish changed from 1998 to 2008? Do two different science text books have the same reading level, determined by word length? cyan magenta ² ² ² yellow 95 100 50 75 25 0 5 95 100 50 75 how you could obtain appropriate data what random variable you need to consider how you would make sure the data is randomly selected. 25 0 95 100 50 75 25 0 5 95 100 50 75 25 0 5 For each problem, discuss: 5 Problem 3: black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\442SA12MA-2_08.CDR Thursday, 16 August 2007 4:44:43 PM DAVID3 SA_12MA-2 STATISTICS (Chapter 8) 443 OPENING PROBLEM 1 Kareline is looking to buy a house in the Adelaide suburb of Prospect.¡ She has collected the information presented in the table, part of which is shown below.¡ Click on the icon to expose all the data. SPREADSHEET For you to consider: ² ² ² ² ² What is the variable being considered? What is the price range of the houses? What is the price range of the middle 50% of the houses? What is the ‘average’ house price? Is it possible for this data to have two ‘averages’? What would be the effect on the interpretation of data if: ² ² ² ² the extreme values were removed (for example, if Kareline was not prepared to spend more than $275 000) one or more data values were incorrect additions were made to the set of data Kareline was only interested in 3-bedroom houses? How reliable is Kareline’s data? How can that reliability be tested? Statistical measures provide powerful tools for answering questions. Kareline may have wondered, ‘What is the mean price of a house in Prospect?’. Such a question provides a starting point for collecting and interpreting data. A DATA AND SAMPLING When information for a statistical investigation is collected and recorded, the information is referred to as data. WHAT IS A STATISTICAL INVESTIGATION? The process that Kareline used to collect and interpret data for her house hunting exercise is an example of a statistical investigation. There are five processes involved in a statistical investigation: cyan For Kareline, the problem examined is to find a reliable ‘average’ cost of a house in Prospect. magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 Stating the problem 75 25 0 5 95 100 50 75 25 0 5 Step 1: black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\443SA12MA-2_08.CDR Thursday, 16 August 2007 4:44:48 PM DAVID3 SA_12MA-2 444 STATISTICS Step 2: Step 3: Step 4: Step 5: (Chapter 8) Collection of data (information) Data for a statistical investigation can be collected from records, from surveys (either face-to-face, telephone, or postal), by direct observation or by measuring or counting. Unless the correct data is collected, valid conclusions cannot be made. Organisation and display of data Data can be organised into tables and displayed on a graph. This allows us to identify features of the data more easily. Kareline has collected the data from rental advertisements in newspapers and on the internet. Kareline has tabulated her data using a spreadsheet. Calculation of descriptive statistics Some statistics used to describe a set of data are the centre and the spread of the data. These give us a picture of the sample or population under investigation. Kareline may calculate the mean (average) house price and range of house prices. She may also look for outliers in the data and decide if the outliers should be included in her investigation. Interpretation of statistics This process involves explaining the meaning of the table, graph or descriptive statistics in terms of the variable, or theory, being investigated. Kareline may explain any graphs generated and interpret the statistics calculated from the data. COLLECTION OF DATA The variable is the subject that we are investigating. The entire group of objects from which information is required is called the population. Gathering statistical information properly is vitally important. If gathered incorrectly then any resulting analysis of the data would almost certainly lead to incorrect conclusions about the population. The gathering of statistical data may take the form of: ² a census, where information is collected from the whole population, or ² a survey, where information is collected from a much smaller group of the population, called a sample. cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 For example: ² The Australian Bureau of Statistics conducts a census of the whole population of Australia every five years. ² In opinion polls before an election, a survey is conducted to see which way a sample of the population will vote. ² The students in a school are to vote for a new school captain.¡ If 20 students from the school are asked how they will vote, then the population is all the students who attend the school, and the 20 students is a sample. black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\444SA12MA-2_08.CDR Thursday, 16 August 2007 4:44:54 PM DAVID3 SA_12MA-2 STATISTICS Note: (Chapter 8) 445 A population generally consists of a large number of items. Because of the expense and time factors it is often easier to select a sample, rather than use the whole population, and hope that the sample is truly representative of the population. For accurate information when sampling, it is essential that: ² the number of individuals in the sample is large enough ² the individuals involved in the survey are randomly chosen from the population.¡ This means that every member of the population has an equal chance of being chosen. If the individuals are not randomly chosen or the sample is too small, the data collected may be biased towards a particular outcome. For example: If the purpose of a survey is to investigate how the population of Adelaide will vote at the next election, then surveying the residents of only one suburb would not provide information that represents all of Adelaide. TYPES OF DATA Data are individual observations of a variable. A variable is a quantity that can have a value recorded for it or to which we can assign an attribute or quality. Two types of variable that we commonly deal with are categorical variables and numerical variables. CATEGORICAL VARIABLES A quality or category is recorded for this type of variable. The information collected is called categorical data. Examples of categorical variables and their possible categories include: Colour of eyes: Continent of birth: blue, brown, hazel, green and violet Europe, Asia, North America, South America, Africa, Australia and Antarctica male or female General Motors, Toyota, Ford, Mazda, BMW, Subaru, etc. Gender: Type of car: We will not consider categorical data in this course. NUMERICAL VARIABLES A number is recorded for this type of variable. The information collected is called numerical data. There are two types of numerical variables: Discrete numerical variables cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 A discrete variable can only take distinct values and these values are often obtained by counting. black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\445SA12MA-2_08.CDR Thursday, 16 August 2007 4:44:59 PM DAVID3 SA_12MA-2 446 STATISTICS (Chapter 8) Examples of discrete numerical variables and their possible values include: 0, 1, 2, 3, ... 0, 1, 2 ..., 29, 30. The number of children in a family: The score on a test, out of 30 marks: Continuous numerical variables A continuous numerical variable can theoretically take any value on a part of the number line. Its value often has to be measured. Examples of continuous numerical variables and their possible values include: The height of Year 12 students: The speed of cars on a stretch of highway: The weight of newborn babies: The time taken to run 100 m: any value from about 140 cm to 220 cm any value from 0 km/h to the fastest speed that a car can travel, but most likely in the range 30 km/h to 120 km/h any value from 0 kg to 10 kg but most likely in the range 0:5 kg to 5 kg any value from 9 seconds to 30 seconds. EXERCISE 8A.1 1 40 students, from a school with 820 students, are randomly selected to complete a survey on their school uniform. In this situation: a what is the population size b what is the size of the sample? 2 A television station is conducting a viewer telephone-into-the-station poll on the question ‘Should Australia become a republic?’ a What is the population being surveyed in this situation? b How is the data biased if it is used to represent the views of all Australians? 3 A new drug called Cobrasyl is approved for the treatment of high blood pressure in humans. The drug, a derivative of cobra venom, is able to reduce blood pressure to an acceptable level. Before its release, a research team treated 127 high blood pressure patients with the drug and in 119 cases it reduced their blood pressure to an acceptable level. a What is the sample of interest? b What is the population of interest? cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 4 A polling agency is employed to survey the voting intention of residents of a particular electorate in the next election. From the data collected they are to predict the election result in that electorate. Explain why each of the following situations would produce a biased sample. a A random selection of people in the local large shopping complex is surveyed between 1 pm and 3 pm on a weekday. b All the members of the local golf club are surveyed. c A random sample of people on the local train station between 7 am and 9 am are surveyed. d A doorknock is undertaken, surveying every voter in a particular street. black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\446SA12MA-2_08.CDR Thursday, 16 August 2007 4:45:05 PM DAVID3 SA_12MA-2 STATISTICS (Chapter 8) 447 5 Classify the following numerical variables as continuous or discrete. a The quantity of fat in a lamb chop. b The mark out of 50 for a Geography test. c The weight of a seventeen year old student. d The volume of water in a cup of coffee. e The number of trout in a lake. f The number of hairs on a cat. g The length of hairs on a horse. h The height of a sky-scraper. i The number of floors sky-scrapers have. j The time taken for students to get from home to school. 6 A sample of public trees in a municipality was surveyed for the following data: a the diameter of the tree (in centimetres) measured 1 metre above the ground b the type of tree c the location of the tree (nature strip, park, reserve, roundabout) d the height of the tree, in metres e the time (in months) since the last inspection f the number of inspections since planting g the condition of the tree (very good, good, fair, unsatisfactory). Classify the data collected as categorical, discrete numerical or continuous numerical. 7 For each of the following: i identify the random variable being considered ii give possible values for the random variable iii indicate whether the variable is continuous or discrete. a To measure the rainfall over a 24-hour period at Mount Gambier the height of water collected in a rain gauge (up to 200 mm) is used. b To investigate the stopping distance for a tyre with a new tread pattern a braking experiment is carried out. c To check the reliability of a new type of light switch, switches are repeatedly turned off and on until they fail. d The publisher of a golfing magazine prints 20 000 copies and is concerned with the number of copies sold. RANDOM SAMPLES When taking a sample it is hoped that the information gathered is representative of the entire population. We must take certain steps to ensure that this is so. If the sample we choose is too small, the data obtained is likely to be less reliable than that obtained from larger samples. For accurate information when sampling, it is essential that: cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 the individuals involved in the survey are randomly chosen from the population the number of individuals in the sample is large enough. 75 25 0 5 95 100 50 75 25 0 5 ² ² black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\447SA12MA-2_08.CDR Thursday, 16 August 2007 4:45:14 PM DAVID3 SA_12MA-2 448 STATISTICS (Chapter 8) For example: Measuring the heights of a group of three fifteen-year-olds would not give a very reliable estimate of the height of fifteen-year-olds all over the world. We therefore need to choose a random sample that is large enough to represent the population. Note that conclusions based on a sample will never be as accurate as conclusions made from the whole population, but if we choose our sample carefully, they will be a good representation. Care should be taken not to make a sample too large as this is costly, time consuming and often unnecessary. A balance needs to be struck so that the sample is large enough for there to be confidence in the results but not so large that it is too costly and time consuming to collect and analyse the data. As we have said before, the sample selected from the population must exhibit the characteristics of the chosen population so that the sample is truly representative of the population. Unless a sample properly represents the population, it would be foolish to draw conclusions about the population based on the sample results. For example, a survey on voters’ preferences prior to an election should include all socioeconomic classes and both male and female voters otherwise the survey may produce biased results which could not be relied upon. THE SIZE OF A SAMPLE The size of a sample should be chosen to reliably reflect the information we want to find out about the entire population Various methods exist to find the appropriate sample size. Some businesses may choose less than the desired number in a sample because of the expense incurred. For example, a medical research team in the UK always chooses a sample of size 80 for this reason. Although may choose a sample of p p there is no mathematical reason for doing so, some people size n when n is the population size. Others might choose n + 10% of n. p p Often both n and n + 10% of n give sample sizes which are too small. Another complication is that the population size n is often unknown. EXERCISE 8A.2 1 Discuss how you would randomly select: a first and second prize in a hockey club raffle b 12 members of the public to stand for jury duty c four numbers from 0 to 37 on a roulette wheel. cyan magenta yellow 15 10 5 sample size 95 100 500 50 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 a From the graph, what sample size would be considered to be large enough? b What is the best estimate of the population mean? mean of sample 20 75 2 In order to estimate the mean of a population, samples of various sizes were taken and in each case the sample mean was found.¡ Alongside is a graph of the results obtained. black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\448SA12MA-2_08.CDR Thursday, 16 August 2007 4:45:20 PM DAVID3 1000 1500 2000 SA_12MA-2 STATISTICS 3 Sample size (n) % in favour of (P ) 200 82:0 500 56:4 1000 69:7 1600 62:9 2500 61:8 3500 62:0 (Chapter 8) 449 5000 61:7 The table above shows the results of asking the question: “Are you in favour of Australia becoming a republic?” a Plot the graph of P against n, with n on the horizontal axis. b At what sample size do the results become reasonably consistent? c What information can we see from this data? 4 Discuss: “In conducting a survey to find out the percentage of people who believe the AFL grand final should always be played at the MCG (Melbourne), it would be a good idea to ask a section of the crowd at this year’s clash between the West Coast Eagles and the Adelaide Crows.” 5 An alpine lake contains trout. On one particular day Rex the research scientist caught 600 trout. They were then tagged and released back into the lake. A fortnight later 350 trout were caught and of these 28 had tags. a Estimate the number of trout in the alpine lake. b In calculating your estimate, what assumptions have you made? 6 When examining the daily production of bottles p of softdrink for quality control purposes, industrial chemist Tomas takes a sample of n bottles (n is the daily production level). a What sample size would Tomas choose if the daily production was 27 583 bottles? b Tomas would choose at random about 1 bottle in every x. Find x. c One day he calculated the sample size to be 143. What was the approximate production level to the nearest 100? p d Tomas has just decided that the sample size is too small and will use n+10% of n bottles in future samples. What sample size would he choose for a daily production was 24 978 bottles? e Suggest why the management may be unhappy with Tomas’s decision in d. cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 7 Most often the population size is unknown. The following formulae are mathematically correct for determining sample size for simple random sampling: ² For an extremely large population where the population size is unknown: To be very confident that a sample accurately reflects the population within §r%, we take a sample of size n where 9600 n= 2 r ² For a population size known to be N: To be very confident that a sample accurately reflects the population within §r%, we take a sample of size n where 9600N n= 9600 + N r2 a To examine the proportion of successes of a new weight reduction drug, a sample of users needs to be taken. How large a sample must be taken to be very confident that the sample accurately reflects the population of users within §3% if the population size is unknown? black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\449SA12MA-2_08.CDR Friday, 17 August 2007 1:19:44 PM DAVID3 SA_12MA-2 450 STATISTICS (Chapter 8) b An executive of Mitbushsui Motors wants to find out how much genuine interest there is in their new series Manga, amongst the Australian community. In order to get a reasonably accurate estimate (within 2%), but at a reasonable cost: i what size sample should they include in their survey ii how should they decide who should be in their survey iii what questions should be asked in the survey? c A reporter for the Port Adelaide Messenger was seeking answers to the question: ‘Who do you intend to vote for at the next Federal election?’. How large a sample would he need if there are 47 621 voters on the electoral roll and he wishes to be very confident of accuracy within §2:5%? d A local council sends a form to households of a suburb of 3578 houses, asking their opinion of a new development in the area. If they expect 60% of recipients to respond, how many forms should be sent out to be very sure the results are accurate within 3%? e To determine whether members of a local gym would be willing to pay higher fees in order to fund the installation of a new swimming pool, a sample of the members is surveyed. Given that there are 568 members at the gym, how large a sample must be taken to be very confident that the sample accurately measures the views of all the members within 3%? f A researcher wishes to find out the proportion of high school students in Adelaide who have part time jobs. She does not know the number of high school students in Adelaide, and wants to be very confident that the sample she surveys accurately reflects the population within 3:5%. If she surveys no more than 50 students from any given high school to minimise bias, what is the least number of schools she must visit? SAMPLING METHODS Possible methods are: A. SIMPLE RANDOM SAMPLING For a sample to have the best chance of being truly representative of the population it should be chosen at random. That is, all members of the population have an equal chance of being chosen in the sample. This is a simple random sample. Random samples can be chosen using coins, dice, numbered tokens, random number tables, or random number generators on computers or calculators. In order to randomly select a sample, each member of the population is assigned a number. If a member’s number appears, that member is part of the sample. For example: cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 Suppose you wish to choose X-lotto numbers. The population of numbers is the integers 1 to 45 inclusive and you are going to choose a ‘sample’ of six different numbers. How could you choose these numbers randomly? black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\450SA12MA-2_08.CDR Thursday, 16 August 2007 4:45:33 PM DAVID3 SA_12MA-2 STATISTICS Method 1: 451 (Chapter 8) Number forty five pieces of paper, place them in a container and select six pieces of paper without looking. Method 2: Use the random number generator on the calculator. Using a Texas Instruments TI-83 Using a Casio fx9860-g Press MATH From the RUN menu, press OPTN F6 (¤) F4 (NUM) 5 to select 5:randInt( from the MATH PRB menu. F2 (INT) Then press ( 45 EXIT F3 (PROB) F4 (Ran#) ) This will bring randInt( to the screen. Now press 1 , 45 ) . Pressing ENTER repeatedly will give random integers between 1 and 45. Ignore repetitions. Then press + Now repeatedly press EXE to produce more random integers. Example 1 Self Tutor 2002 2003 2004 2005 2006 2007 43:1 48:7 45:7 44:0 48:6 46:3 38:2 35:3 36:4 38:3 37:7 40:2 38:6 36:0 36:2 34:8 35:3 33:3 40:2 40:9 42:4 42:5 43:8 35:7 43:2 44:2 47:0 48:7 50:3 52:4 27:8 32:3 33:5 34:1 32:2 35:8 26:4 27:2 23:5 27:2 27:7 28:1 23:8 24:9 24:8 27:6 26:1 28:2 27:4 30:8 32:7 33:6 34:9 35:1 40:4 39:3 38:7 41:3 42:4 44:9 68:3 67:4 67:3 69:8 70:4 72:6 81:2 83:9 84:6 85:5 88:3 87:2 magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 There are twelve months from which we need to choose one month. We use the calculator, with 1 representing January, 2 representing February, etc. The randomly chosen month is November. 100 b 50 There are six years from which to choose. We could use a die to randomly choose one of these years; the year 2002 would be represented by 1, 2003 by 2,...... , 2007 by 6. Alternatively, we could use the random generator on a calculator. The randomly chosen year is 2006. 75 a 25 0 5 95 100 50 75 25 0 5 The table shown gives the monthly sales figures, in January thousands of dollars, for a February shop over a six year period. March a Choose a year at April random. May June b Choose a month at July random. August c Choose three consecuSeptember tive years. October November December cyan 1 EXE black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\451SA12MA-2_08.CDR Monday, 20 August 2007 10:15:12 AM DAVID3 SA_12MA-2 452 STATISTICS c (Chapter 8) To choose three consecutive years, we need to establish the number of sets of three consecutive years that are possible: 1 2002 - 2004 2 2003 - 2005 3 2004 - 2006 4 2005 - 2007 There are four possibilities, from which we have to choose one.¡ Using the calculator, the randomly chosen period is 3, that is, 2004 to 2006. To choose a simple random sample: 1 Find the sample size needed. 2 State the number of possibilities from which you can choose, and number them if necessary. 3 State the random number generator that you are using. 4 Explain what you will do if repeated random numbers are not applicable. 5 State the random number(s) chosen and the data that is now in your sample. EXERCISE 8A.3 1 Use a b c d your calculator to: select a random sample select a random sample select a random sample select a random sample of of of of six different numbers between 5 and 25 inclusive 10 different numbers between 1 and 25 inclusive six different numbers between 1 and 45 inclusive 5 different numbers between 100 and 499 inclusive. 2 Click on the icon to obtain a printable calendar for 2008 showing CALENDAR the weeks of the year. Each of the days is numbered. Using a random number generator, choose a sample from the calendar of: a five different dates b a complete week starting with a Monday c a month d three different months e three consecutive months f a four week period starting on a Saturday g a four week period starting on any day. Explain your method of selection in each case. cyan magenta yellow 95 Wk 11 1 2 3 4 5 6 7 8 9 10 11 Tu We Th Fr Sa Su Mo Tu We Th Fr 100 25 0 ... Wk 10 50 March (61) (62) (63) (64) (65) (66) (67) (68) (69) (70) (71) 75 Sa Su Mo Tu We Th Fr Sa Su Mo Tu 5 95 1 2 3 4 5 6 7 8 9 10 11 100 50 75 25 ... 0 95 50 75 25 0 ... February Fr (32) Sa (33) Su (34) Mo (35) Tu (36) Wk 6 We (37) Th (38) Fr (39) Sa (40) Su (41) Mo (42) 100 1 2 3 4 5 6 7 8 9 10 11 5 January Tu (1) Wk 1 We (2) Th (3) Fr (4) Sa (5) Su (6) Mo (7) Tu (8) Wk 2 We (9) Th (10) Fr (11) 5 95 100 50 75 25 0 5 1 2 3 4 5 6 7 8 9 10 11 black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\452SA12MA-2_08.CDR Thursday, 16 August 2007 4:45:46 PM DAVID3 April (92) (93) (94) (95) (96) (97) (98) (99) (100) (101) (102) ... Wk 14 Wk 15 1 2 3 4 5 6 7 8 9 10 11 Th Fr Sa Su Mo Tu We Th Fr Sa Su May (122) (123) (124) (125) (126) (127) Wk 19 (128) (129) (130) (131) (132) ... SA_12MA-2 STATISTICS 453 (Chapter 8) B. SYSTEMATIC SAMPLING Example 2 Self Tutor Management of a large city store wishes to find out how potential customers like the look of a new product and whether they would buy it. They decide on a 5% systematic sampling procedure. Explain what this means. We notice that: 5% = 5 100 = 1 20 So, 1 in 20 people passing by is asked to participate. If we start with, say, the 3rd person who passes by, then we need to ask the 23rd, 43rd, 63rd, 83rd, 103rd, ..... and so on for a period until sufficient data is obtained. To obtain a k% random sample, we need to choose a starting place and then choose ¡ 100 ¢ every k th one after that. If an accountancy firm wishes to randomly survey its 3217 clients using systematic sampling, they may do it at a 10% level. Since their clients each have files then they might select the 3rd, 13th, 23rd, 33rd, etc. 23rd 13th 3rd C. STRATIFIED SAMPLING Suppose you wish to know the opinions of the whole student body on possible changes to the school uniform. Simple random sampling may not be appropriate, as due to chance a disproportionate number of say year 11s may be chosen and their views may not be considered to represent the views of all students. What we do is randomly sample each year level with a sample size proportional to the number in that year level. Example 3 Self Tutor In our school there are 137 year 8’s, 152 year 9’s, 174 year 10’s, 168 year 11’s and 121 year 12’s. A stratified sample of 50 students is needed. How many should be randomly selected from each group? Total number of students in the school is: 137 + 152 + 174 + 168 + 121 = 752 ) number of year 8’s = 137 752 £ 50 + 9 number of year 9’s = 152 752 £ 50 + 10 number of year 10’s = 174 752 £ 50 + 12 number of year 11’s = 168 752 £ 50 + 11 number of year 12’s = 121 752 £ 50 + 8 cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 We then have to randomly select 9 year 8’s, 10 year 9’s etc. in the same way. black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\453SA12MA-2_08.CDR Thursday, 16 August 2007 4:45:52 PM DAVID3 SA_12MA-2 454 STATISTICS (Chapter 8) To obtain a stratified random sample the population is divided into subgroups called strata and random samples are proportionally selected from each subgroup. Strata Random Samples Note: Other sampling techniques can be used, for example, Cluster sampling. We do not consider them in this course. Year 8s Year 9s Year 10s Year 11s Year 12s EXERCISE 8A.4 1 An NBL basketball club averages 3540 spectators per game.¡ The catering manager wants to conduct a survey to investigate the proportion of spectators who would spend more than $20 on food and drinks at the game.¡ He decides to survey the first 40 people through the gate. a Discuss any potential bias in the method chosen. b How reliable would the sample to estimate the proportion be in reflecting the population’s spending? Discuss the sample size in your answer. c Suggest a better sampling method that includes a suitable sample size that would better represent the population. 2 A golf club has 1800 members with ages in the folAge range No. of members lowing ranges: under 18 257 A member survey is to be undertaken to determine 18 < 40 421 the proportion who want changes to dress regula40 < 55 632 tions. 55 < 70 356 a Why wouldn’t the golf club survey all members over 70 134 on the proposed changes to dress regulations? b What minimum sample size should the golf club consider to be 95% confident of accuracy within 5%? c If a stratified sample size of 350 is to be used, how many of each age group above should be surveyed? 3 A large retail store has the following staff: departmental managers - 10; supervisors - 24; senior sales staff - 62; junior sales staff - 98; shelf packers - 28. Management wishes to interview a sample of 30 staff to obtain an overall picture of the staff view of operating procedures. How many of each group of staff members would be selected for the sample to be representative of overall staff opinion? cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 4 A school has the following enrolments: A financial planner wishes to survey the students to investigate the number of students who receive more than $10 pocket money each week. She decides on a sample size of 30. a Is a sample size of 30 likely to provide a reliable estimate of the proportion of the population who receive more than $10 per week? Explain. black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\454SA12MA-2_08.CDR Thursday, 16 August 2007 4:45:58 PM DAVID3 Year group Boys Girls 8 82 51 9 73 75 10 52 94 11 78 46 12 84 98 SA_12MA-2 STATISTICS (Chapter 8) 455 b If the survey is to be done using a stratified sampling procedure, calculate the number to be included in the survey of: i boys ii girls iii year 8 girls iv year 11 boys v year 12s c Suggest a way of increasing the reliability of the sample results. 5 The 200 students in year 11 and 12 of a high school were asked whether (y) they had ever smoked a cigarette. The replies, as they were received, were: nnnny nnnyn ynnnn yynyy ynyny ynnyn nyynn yynyn ynnyn nnyyy yyyyy nnnyy nnnnn nnyny yynny nynnn ynyyn nnyny ynnnn yyyyn yynnn nynyn nynnn yynny nyynn yynyn ynynn ynnyy nyyny ynynn nyynn nnnyy ynyyn yyyny ynnyy nnyny or not (n) yynyy ynyyy nyyyn ynnnn a Why is this data considered in this case to be a population? b Find the actual proportion of all students who said they had smoked. c Examine the validity and usefulness of the following sampling techniques which could have been used to estimate the proportions in b without actually counting them: i sampling the first five replies ii sampling the first ten replies iii sampling every second reply iv sampling the fourth member of every group of five v randomly selecting 30 numbers from 001 to 200 and choosing the response corresponding to that number. (Note: The 96th response is coloured.) d Are any of (simple random sample, systematic sample, stratified sample) used in c i to v ? 6 Imagine you are an agricultural researcher with a trial plot of fodder grass on which you are testing a new fertiliser.¡ The plot is 10 metres square.¡ After the grass has been growing for one month, you need to harvest a sample to weigh.¡ It is too time-consuming to collect every blade of grass so you need to collect a sample representative of the whole plot. cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 a Describe and explain how you would divide up the plot to select a sample of grass to collect and weigh. You could use a set of random numbers in some way. b Explain why you think it is necessary to select a random sample across the trial plot and not just the corner. black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\455SA12MA-2_08.CDR Thursday, 16 August 2007 4:46:04 PM DAVID3 SA_12MA-2 456 STATISTICS (Chapter 8) SAMPLING ERRORS Sampling errors are not errors if they are intentional. We will briefly consider how unintentional sampling errors can occur. Errors in sampling could arise from: ² bias caused by faults in the sampling process, sometimes called systematic errors. For example, in sampling flat rent figures in a suburb one must not consider only the large advertisements as these may more frequently be for classier, more expensive accommodation. This sort of bias is often unintentional. Remember that the sample must truly represent the population. ² statistical (or random) errors which are caused by natural variability. A sample may not reflect the population due these errors. However, in much larger samples these errors tend to be fewer. SAMPLE SIZE WHEN ESTIMATING A POPULATION MEAN INVESTIGATION 1 HOW LARGE MUST A SAMPLE BE? Click on the icon to view a population of known mean x. DEMO What to do: 1 Select a sample of size n = 2 and find its mean x. 2 Repeat several times. Comment on how x compares with the population mean. 3 Now select samples of size n = 10 and in each case find x. Comment on how these xs compare with the true population mean. 4 Repeat for samples of size n = 100. 5 Write a brief report on your findings. From the investigation you should have observed that: The larger the sample size, the closer the mean of the sample reflects the mean of the population. This is true for other population characteristics, for example, the standard deviation. We examine the mean and standard deviation in greater detail later. It is true to say that: “The greater the sample size, the more reliable will be our findings”. cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 However, we must strike a balance between the confidence in the reliability of our results and the expense of carrying out a large sampling procedure. black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\456SA12MA-2_08.CDR Thursday, 16 August 2007 4:46:11 PM DAVID3 SA_12MA-2 STATISTICS B (Chapter 8) 457 ANALYSIS AND REPRESENTATION Once data has been collected and organised (in table form) it is ready to be analysed and represented in graphical form. DISCRETE NUMERICAL DATA Recall that a discrete numerical variable can take only distinct values. The data is often obtained by counting. For example, a farmer has a crop of peas and wishes to investigate the number of peas in the pods. He takes a random sample of 50 pods and counts the number of peas in each pod, obtaining the following data: 6654987776567888752477678 8786642913359887767768455 The variable in this situation is the discrete numerical variable ‘the number of peas in a pod’. The data could only take the discrete numerical values 0, 1, 2, 3, 4, .... TABLES AND GRAPHS To organise his data the farmer could use the tally and frequency table shown. A barchart could be used to display the results. 14 12 10 8 6 4 2 0 No. peas in pod Tally Frequency 1 j 1 2 jj 2 3 jj 2 4 jjjj 4 © j 5 © jjjj 6 © jjjj 6 © jjjj 9 © © © jjj 7 © jjjj jjjj 13 © © © 8 © jjjj jjjj 10 9 jjj 3 Total 50 frequency 0 1 2 3 4 5 6 7 8 9 number of peas in pod Alternatively, the farmer could use a dot plot which is a convenient method of tallying the data and at the same time displaying the frequencies. To draw a dot plot: 1 Draw a horizontal axis and mark it with the values that the variable can take. For this example, the variable took values from 1 to 9, so we mark the axis from 0 to 10. 2 Label the axis with a description, in this case: number of peas in pod. 3 Systematically go through the data, placing a dot or cross above the appropriate position on the axis. The dot plot for this example is: cyan magenta yellow 4 100 50 75 95 3 2 25 0 1 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 0 black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\457SA12MA-2_08.CDR Thursday, 16 August 2007 4:46:17 PM DAVID3 5 6 7 9 8 10 number of peas in pod SA_12MA-2 458 STATISTICS (Chapter 8) Notice that the dots are evenly spaced so the final plot looks similar to the barchart. From both the barchart and the dot plot it can be seen that: ² Seven was the most frequently occurring number of peas in a pod. 100 ² 35 50 £ 1 = 70% of the pods yielded six or more peas. ² 10% of the pods had fewer than 4 peas in them. DESCRIBING THE DISTRIBUTION OF A SET OF DATA The distribution of a set of data is the pattern or shape of its graph. For the example above, the graph has the general shape shown alongside: stretched to the left This distribution of the data is said to be negatively skewed because it is stretched to the left (the negative direction). A positively skewed distribution of data would have a shape: A symmetrical distribution of data is neither positively nor negatively skewed, but is symmetrical about a central value. stretched to the right A set of data whose graph has two peaks is said to be bimodal. Note that the horizontal is a number line with numbers in ascending order from left to right. Outliers are data values that are either much larger or much smaller than the general frequency body of data.¡ Outliers appear separated 12 from the body of data on a frequency graph. 10 magenta yellow outlier 95 100 50 75 0 1 2 3 4 5 6 7 8 9 10 11 12 13 number of peas in pod 25 0 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 cyan 5 8 6 4 2 0 For the example, if the farmer found one pod in his sample contained 13 peas then the data value 13 would be considered an outlier.¡ It is much larger than the other data in the sample.¡ On the column graph it appears separated. black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\458SA12MA-2_08.CDR Thursday, 16 August 2007 4:46:22 PM DAVID3 SA_12MA-2 STATISTICS 459 (Chapter 8) EXERCISE 8B.1 1 A randomly selected sample of households in both Australia and Thailand were asked, “How many people live in your household?” Column graphs have been constructed for the results. Size of households (Thailand) frequency frequency Size of households (Australia) 8 6 8 6 4 4 2 2 0 For a b c d 1 0 3 4 5 6 7 8 9 10 number of people in the household 2 1 3 4 5 6 7 8 9 10 number of people in the household 2 each of Australia and Thailand, answer the following: How many households were surveyed? How many households had only one or two occupants? What percentage of the households had five or more occupants? Compare the distribution of the data for each survey. 2 A bowler recorded the number of wickets he took in the first 15 innings of the season and the last 15 innings of the season. 1st half of season: 1 1 3 2 0 0 4 2 2 4 3 1 0 1 0 2nd half of season: 2 1 5 1 3 7 2 2 2 4 3 1 1 0 3 a Construct side by side dot plots for each set of data. b Compare the distributions of the data sets, noting any outliers. c In which part of the season did the bowler have more success? Give evidence. 3 For an investigation into the number of phone calls made by teenagers, samples of 50 thirteen-year-olds and 50 fifteen-year-olds were asked the question, “How many phone calls did you make yesterday?” The given dot plot was constructed for the data. magenta yellow 0 1 2 3 4 5 6 7 8 9 10 11 number of phone calls 15 y.o. 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 75 50 25 0 5 95 100 50 75 25 0 5 13 y.o. What is the variable in this investigation? Explain why the data is discrete numerical data. What percentage of each age group did not make any phone calls? What percentage of each age group made 5 or more phone calls? Describe and compare the distributions of the sets of data. How would you describe the data value ‘11’ for each set of data? a b c d e f cyan The no. of phone calls made in a day by teenagers black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\459SA12MA-2_08.CDR Thursday, 16 August 2007 4:46:28 PM DAVID3 SA_12MA-2 460 STATISTICS (Chapter 8) CONTINUOUS NUMERICAL DATA The height of 14-year-old children is being investigated. The variable ‘height of 14-year-old children’ is a continuous numerical variable because the values recorded for the variable could, theoretically, be any value on the number line. They are most likely to fall between 120 and 190 centimetres. The heights of thirty children are measured in centimetres. The measurements are rounded to one decimal place, and the values recorded below: 163:0 154:2 152:8 160:5 148:3 149:2 154:7 172:7 171:3 162:5 165:0 160:2 166:2 175:3 143:4 174:6 180:9 162:4 167:3 158:4 159:4 164:5 163:7 183:8 150:8 163:4 181:9 158:3 165:0 156:8 Note that these rounded values are actually discrete. However, when we tally them, we use continuous class intervals as follows: The smallest height is 143:4 cm and the largest is 183:8 cm so we will use class intervals 140 up to 150 (this does not include 150), 150 up to 160, 160 up to 170, 170 up to 180, 180 up to 190. Note that we choose class intervals of the same width. These class intervals are written as 140 - < 150, 150 - < 160, etc. in the frequency table. The final class interval is written as 180 - < 190 which means 180 cm up to a height that is less than 190 cm. A tally-frequency table for this example is: Height (cm) 140 - < 150 150 - < 160 160 - < 170 170 - < 180 180 - < 190 Total Tally jjj © jjj © jjjj © © © jj © jjjj jjjj jjjj jjj Frequency 3 8 12 4 3 30 A histogram is used to display continuous numerical data. This is similar to a barchart but because of the continuous nature of the variable, the ‘bars’ are joined together. The frequency is represented by the height of the ‘bars’. A histogram for this example is shown opposite: Heights of a sample of fourteen-year-old children 12 frequency 8 4 cyan magenta yellow 95 150 100 50 75 140 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 0 black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\460SA12MA-2_08.CDR Thursday, 16 August 2007 4:46:34 PM DAVID3 160 170 180 190 height (cm) SA_12MA-2 STATISTICS 461 (Chapter 8) RELATIVE FREQUENCY DISTRIBUTIONS When we compare two distributions which come from different sample sizes, a relative frequency distribution is used for each of them. Relative frequency tables show the proportion (or percentage) for each class. A relative frequency table and histogram can be drawn for the ‘height of 14-year-olds’ data. Height (cm) 140 150 160 170 180 Frequency - < 150 - < 160 - < 170 - < 180 - < 190 Total Relative % 3 30 3 8 12 4 3 30 relative frequency % £ 100 = 10% 26:7% 40% 13:3% 10% 100% 40 30 20 10 0 140 150 160 170 180 190 height (cm) From the tables and graphs we can see: ² More children had a height in the class interval 160 up to 170 cm than any other class interval. This class interval is called the modal class. 12 30 £ 100 = 40% of the children had a height in this class. ² 3 £ 100 = 10%) had a height less than 150 cm. Three of the children ( 30 ² Three of the children (10%) were 180 cm or more tall. ² The distribution of heights was approximately symmetrical. EXERCISE 8B.2 1 The speeds of cars and trucks travelling along a section of highway have been recorded separately and displayed using the histograms below. 200 200 number of cars 150 150 100 100 50 50 0 50 70 90 number of trucks 0 110 130 speed (km/h) 50 70 90 110 130 speed (km/h) cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 a How many vehicles were included in each survey? b Compare the percentage of cars and trucks that were travelling at speeds equal to or greater than 100 km/h. c Compare the percentage of the cars and trucks that were travelling at a speed less than 80 km/h. d If the owners of the vehicles travelling at 110 km/h or more were fined $165 each, what amount would be collected in fines? e Compare the shapes of the two histograms. black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\461SA12MA-2_08.CDR Thursday, 16 August 2007 4:46:40 PM DAVID3 SA_12MA-2 462 STATISTICS (Chapter 8) 2 The daily maximum temperature (o C) to the nearest degree, in Adelaide and Hobart, for each day in January 2006, is recorded below: 34 24 29 22 Adelaide: Hobart: 38 26 31 25 31 35 25 28 38 36 23 16 23 25 18 17 24 32 24 19 25 27 19 24 26 30 20 26 29 34 21 26 35 30 22 27 41 27 28 23 23 25 25 22 32 26 22 18 36 23 17 20 22 21 25 18 21 22 a Using class intervals of 5 degrees construct a tally and frequency table for each city. b Construct histograms to display the data. c Compare the distribution of Adelaide’s daily maximum temperatures in January 2006 with Hobart’s. 3 The height of each member of a basketball club has been measured and the results are displayed using the frequency table alongside. a Calculate the relative frequencies and construct a relative frequency histogram for each sex. b Compare the distributions of the heights. c Find the percentage of members of each sex whose height is: i greater than 180 cm ii less than 170 cm iii between 175 and 190 cm. Height (cm) 165 170 175 180 185 190 195 200 - < 170 < 175 < 180 < 185 < 190 < 195 < 200 < 205 C Male Female Frequency Frequency 1 1 3 2 5 12 12 8 7 6 5 2 2 1 1 0 STEMPLOTS Constructing a stem-and-leaf plot, commonly called a stemplot, is often a convenient method to organise and display a set of numerical data. A stemplot groups the data and shows the relative frequencies but has the added advantage of retaining the actual data values. CONSTRUCTING A STEMPLOT Data values such as 25 36 38 49 23 46 47 15 28 38 34 are all two digit numbers, so the first digit will be the ‘stem’ and the last digit the ‘leaf’ for each of the numbers. The stems will be 1, 2, 3, 4 to allow for numbers from 10 to 49. cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 The stemplot for the data is shown alongside. Stem Leaf Notice that: 1 5 ² 1 j 5 represents 15 2 358 3 4688 ² 2 j 3 5 8 represents 23, 25 and 28 4 679 2 j 3 means 23 ² the data in the leaves is evenly spaced with no commas ² the leaves are placed in increasing order, so this stemplot is ordered ² the scale (sometimes called the key) tells us the place value of each leaf. If the scale was 2 j 3 means 2:3, then 4 j 6 7 9 would represent 4:6, 4:7 and 4:9. black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\462SA12MA-2_08.CDR Thursday, 16 August 2007 4:46:46 PM DAVID3 SA_12MA-2 STATISTICS 463 (Chapter 8) If the stems are written with the least number at the top then the stemplot can be rotated so that the values on the horizontal axis are in ascending order and you can see the shape of the distribution. For data values such as 195 199 207 183 201 .... the first two digits are the stem and the last digit is the leaf. Example 4 Self Tutor The score, out of 50, on a test was recorded for 36 students. a Organise the data using a stemplot. 25 36 38 49 23 46 47 15 28 38 34 9 30 24 27 27 42 16 28 31 24 46 25 31 b Comment on the distribution of the 37 35 32 39 43 40 50 47 29 36 35 33 data. Recording the data from the list gives an unordered stemplot: Stem 0 1 2 3 4 5 b Leaf 9 56 538 688 967 0 Ordering the data from smallest to largest produces an ordered stemplot: Stem 0 1 2 3 4 5 4778459 40117529653 26307 2 j 4 means 24 marks The shape of the distribution can be seen when the stemplot is rotated: The data is slightly negatively skewed. Leaf 9 56 3445577889 01123455667889 02366779 0 Leaf 9 56 34455 77889 01123455667889 02366779 0 a Stem 0 1 2 3 4 5 We also observe these important features: ² The minimum (smallest) test score is 9. ² The maximum (largest) test score is 50. SPLIT STEMS Consider the following example: The residue that results when a cigarette is smoked collects in the filter. This residue has been weighed for twenty cigarettes, giving the following data, in mg. 1:62 1:55 1:59 1:56 1:56 1:55 1:63 1:59 1:56 1:69 1:61 1:57 1:56 1:55 1:62 1:61 1:52 1:58 1:63 1:58 cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 Scanning the data reveals that there will be only two ‘stems’, i.e., 15 and 16. In cases like this we will need to split the stems. black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\463SA12MA-2_08.CDR Thursday, 16 August 2007 4:46:52 PM DAVID3 SA_12MA-2 464 STATISTICS (Chapter 8) If we use the stem 15 to represent data with values 1:50 to 1:54 and 15¤ to represent data with values 1:55 to 1:59 etc., we can construct a stemplot with four stems: Stem 15 15¤ 16 16¤ Leaf 2 555666678899 112233 9 15 j 2 means 1:52 If we split the stems five ways, where 150 represents data with values 1:50 and 1:51, 152 represents data with values 1:52 and 1:53 etc., the stemplot becomes: Stem 150 152 154 156 158 160 162 164 166 168 The stemplot with the stems split five ways clearly gives a better view of the distribution of the data. The value 1:69 appears as an outlier in this graph. The stemplot with the stems split two ways was not sensitive enough to show this. Leaf 2 5 6 8 1 2 5 6 8 1 2 5 667 99 33 9 BACK-TO-BACK STEMPLOTS A back-to-back stemplot is a visual display that enables easy analysis and comparison of two sets of data. Consider this example: An office worker has the choice of travelling to work by tram or train. He has recorded the travel times from recent journeys on both of these types of transport. He wishes to know which type of transport is quicker and which is the more reliable. Recent tram journey times (minutes): 21, 25, 18, 13, 33, 27, 28, 14, 18, 43, 19, 22, 30, 22, 24 Recent train journey times (minutes): 23, 18, 16, 16, 30, 20, 21, 18, 18, 17, 20, 21, 28, 17, 16 A back-to-back stemplot could be used to display the relationship between the categorical variable type of transport which has two categories (or levels), and the numerical variable travel time. The type of transport is the independent variable and the travel time is the dependent variable, because the travel time depends on the type of transport. A back-to-back stemplot is constructed with only one stem. The leaves are grouped on either side of this central stem. The ordered back-to-back stemplot for the data is shown alongside: Train leaf 88877666 831100 0 Stem 1 2 3 4 Tram leaf 34889 1224578 03 3 cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 The most frequently occurring travel times by train were between 10 and 20 minutes whereas the most frequently occurring travel times by tram were between 20 and 30 minutes. It seems as if it is generally quicker and the travel times are more reliable if the worker travels by train to work. black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\464SA12MA-2_08.CDR Thursday, 16 August 2007 4:46:58 PM DAVID3 SA_12MA-2 STATISTICS 465 (Chapter 8) EXERCISE 8C 1 The heights (to the nearest centimetre) of Year 10 boys and girls in a school are being investigated. The sample data are as follows: Boys: 164 168 175 169 172 171 171 180 168 168 166 168 170 165 171 173 187 179 181 175 174 165 167 163 160 169 167 172 174 177 188 177 185 167 160 Girls: 165 170 158 166 168 163 170 171 177 169 168 165 156 159 165 164 154 170 171 172 166 152 169 170 163 162 165 163 168 155 175 176 170 166 a Construct a back-to-back stemplot for the data. b Compare and comment on the distributions of the data, mentioning the shape. c What percentage of each sex are 175 cm or taller? 2 A new cancer drug is being developed and is being tested on rats. Two groups of twenty rats with cancer were formed; one group was given the drug while the other was not. The survival time of each rat in the experiment was recorded up to a maximum of 192 days. Survival times of rats that were given the drug: 64 78 106 106 106 127 127 134 148 186 192¤ 192¤ 192¤ 192¤ 192¤ 192¤ 64 78 106 106 Survival times of rats that were not given the drug: 37 38 42 43 43 43 43 43 48 49 51 51 55 57 59 62 66 69 86 37 ¤ denotes that the rat was still alive at the end of the experiment a Construct a back-to-back stemplot for the data. b Compare and comment on the distributions of the data, mentioning the shape. c What percentage of each group of rats survived for 70 days or more? 3 Peter and John are competing taxi-drivers who wish to know who earns more money. They have recorded the amount of money (in dollars) collected per hour for five hours over five days: Peter: 17:3 11:3 15:7 18:9 9:6 13 19:1 18:3 22:8 16:7 11:7 15:8 12:8 24 15 13 12:3 21:1 18:6 18:9 13:9 11:7 15:5 15:2 18:6 John: 23:7 10:1 8:8 13:3 12:2 11:1 12:2 13:5 12:3 14:2 18:6 18:9 15:7 13:3 20:1 14 12:7 13:8 10:1 13:5 14:6 13:3 13:4 13:6 14:2 a Construct a back-to-back stemplot for the data. b Compare and comment on the distributions of the data, mentioning the shape, and any outliers. c Who seems to collect more money per hour? 4 The residue that results when a cigarette is smoked collects in the filter. The residue from twenty cigarettes from the two different brands was measured, giving the following data, in milligrams: cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 Brand X: 1:62 1:55 1:59 1:56 1:56 1:55 1:63 1:59 1:56 1:69 1:61 1:57 1:56 1:55 1:62 1:61 1:52 1:58 1:63 1:58 black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\465SA12MA-2_08.CDR Thursday, 16 August 2007 4:47:05 PM DAVID3 SA_12MA-2 466 STATISTICS (Chapter 8) Brand Y: 1:61 1:62 1:69 1:62 1:60 1:59 1:66 1:55 1:61 1:62 1:64 1:61 1:58 1:57 1:57 1:57 1:58 1:60 1:63 1:59 a Copy and complete the back-to-back stemplot for this data: Stem 150 152 154 156 158 160 162 164 166 168 Brand Y Brand X 2 5 6 8 1 2 5 6 8 1 2 5 667 99 33 156 includes values 1:56 and 1:57 9 b Comment on and compare the shape of the distributions. D MEASURES OF CENTRE A picture of a data set can be obtained if we have an indication of the centre of the data and the spread of the data. Two statistics that provide a measure of the centre of a set of data are: ² the mean ² the median. THE MEAN How a class performs in a mathematics test is quickly and probably best described by quoting the arithmetic mean (often called the average) of the distribution of marks. The mean of n numbers is obtained by summing the numbers and then dividing by n. For the numbers x1 , x2 , x3 , x4 , .... , xn , the mean is x = x1 + x2 + x3 + x4 + ::::: + xn : n Example 5 Self Tutor The results of a biology test (out of 50) are given below: 44 7 30 40 22 32 39 13 38 35 31 36 29 34 27 39 37 16 35 41 35 45 20 32 23 38 48 46 Find the mean of the test results. 44 + 7 + 30 + :::::: + 38 + 46 28 912 = 28 + 32:6 cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 Mean, x = black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\466SA12MA-2_08.CDR Thursday, 16 August 2007 4:47:12 PM DAVID3 SA_12MA-2 STATISTICS 467 (Chapter 8) Note: ² ² The mean involves all the data values. If you are told that the mean mark for a test is 65% then there will be some marks higher than 65% and some marks lower than 65%. ² The mean does not have to be one of the data values. For example: The mean number of children per family is 1:8 in Adelaide. It is obvious that a family cannot have 1:8 children but this statistic tells us that most families have either 1 or 2 children, with more families having 2 children. THE MEDIAN When a set of data is written in order, the median is the middle value of the set. For the biology test results, the ordered data set is: 7 13 16 20 22 23 27 29 30 31 32 32 34 35 35 35 36 37 38 38 39 39 40 41 44 45 46 48 There are two middle scores, so the median score = 35. ftheir averageg For a sample of size n, the median is the Note: ¡ n+1 ¢th 2 score. If n is odd, say 17, the median is the 17+1 2 = 9th score. If n is even, say 18, the median is the 18+1 2 = 9:5th score DEMO indicating the average of the 9th and 10th scores. Example 6 Self Tutor Find the median for the following data sets: a 5573823465764 b 3 5 5 5 5 6 6 6 7 7 7 8 8 8 9 10 a The data set is ordered (arranged from smallest to largest). 2 3 3 4 4 5 5 5 6 6 7 7 8 13 + 1 = 7th value (circled). 2 The median is the The median is 5. There are 16 data values so the median is the average of the 8th and 9th values (circled). 3 5 5 5 5 6 6 6 7 7 7 8 8 8 9 10 cyan magenta yellow 95 100 50 75 (Note: This is not one of the data values.) 25 0 5 95 100 6+7 = 6:5 2 50 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 The median is 75 b black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\467SA12MA-2_08.CDR Thursday, 16 August 2007 4:47:18 PM DAVID3 SA_12MA-2 468 STATISTICS (Chapter 8) Note on symmetry frequency This distribution is symmetric.¡ Data values are symmetrically spread about the centre. For a symmetrical distribution the mean and median are equal (or approximately equal). mean and median frequency frequency mean median median This distribution is negatively skewed (or skewed left) and the mean < the median. mean This distribution is positively skewed (or skewed right) and the mean > the median. FINDING THE MEAN AND MEDIAN OF UNGROUPED DATA Consider the data 2, 3, 5, 4, 3, 6, 5, 7, 3, 8, 1, 7, 5, 5, 9: For TI-83 Data is entered in the STAT EDIT menu.¡ Press STAT 1 to select 1:Edit In L1, delete all existing data.¡ Enter the new data. Press 2 ENTER then 3 ENTER etc, until all data is entered. To obtain the descriptive statistics to select the STAT CALC menu.¡ Press 1 to select 1:1–Var Stats Press STAT Pressing 2nd 1 (L1) ENTER gives the mean x = 4:87 (to 3 sf) cyan magenta yellow 95 100 50 median = 5 75 25 0 5 95 100 50 75 25 0 repeatedly gives the 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 Scrolling down by pressing black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\468SA12MA-2_08.CDR Monday, 20 August 2007 10:15:23 AM DAVID3 SA_12MA-2 STATISTICS 469 (Chapter 8) For Casio From the Main Menu, select STAT. In List 1, delete all existing data and enter the new data. Press 2 EXE then 3 EXE etc until all data is entered To obtain the descriptive statistics Press F6 (¤) if the GRPH icon is not in the bottom left corner of the screen. Press F2 (CALC) F1 (1VAR) which gives the mean x = 4:87 (to 3 sf) Scrolling down by pressing repeatedly gives the median = 5 MEAN AND MEDIAN FOR GROUPED DISCRETE DATA Example 7 Self Tutor The frequency table alongside shows data collected from a random sample of 50 households in a particular suburb, investigating the number of people in the household. Use the calculator to find the mean and median of the number of people in a household for this sample. Number of people Frequency in the household 1 2 3 4 5 6 5 8 13 14 7 3 For TI-83 Press STAT 1 to select 1:Edit.¡ Key the variable values into L1 and the cyan magenta yellow 95 100 50 1 to select 1:1–Var Stats from the 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 frequency values into L2.¡ Press STAT STAT CALC menu. black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\469SA12MA-2_08.CDR Monday, 20 August 2007 10:15:36 AM DAVID3 SA_12MA-2 470 STATISTICS (Chapter 8) Enter L1, L2 by pressing 2nd 1 (L1) , 2nd 2 (L2) ENTER The mean is 3:38¡. Scroll down, and the median is 3. Note: If you do not include L2 you will get a screen of statistics for L1 only. For Casio From the Main Menu, select STAT. Key the variable values into List 1 and the frequency values into List 2. Press F6 (¤) if the GRPH icon is not in the bottom left corner of the screen. Press F2 (CALC) F6 (SET) variable to List 2. Press EXIT F3 (List2) to change the frequency F1 (1VAR) The mean is 3:38 . Scroll down, and the median is 3. MEAN AND MEDIAN FOR GROUPED CONTINUOUS DATA If continuous data is grouped using class intervals, we use the midpoints of the class intervals as the variable values. Example 8 Self Tutor The time taken by students to complete a mid-year examination in Economics for all students participating is given in the table following (in minutes). 60 - < 70 70 - < 80 80 - < 90 90 - < 100 100 - < 110 110 - < 120 Time 1 Students 2 11 24 28 13 a What do you suspect was the duration of the examination paper? b What are the midpoints (x) of the time intervals? c Calculate the mean and median time to complete the exam. a The exam paper was for 120 minutes (2 hours). cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 b The midpoints are 65, 75, 85, 95, 105, 115 (minutes). black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\470SA12MA-2_08.CDR Monday, 20 August 2007 10:15:47 AM DAVID3 SA_12MA-2 STATISTICS c 471 (Chapter 8) For TI-83 For Casio We enter the midpoints into L1 and the frequencies into L2: We enter the midpoints into List 1 and the frequencies into List 2: We then proceed using the instructions as in Example 9 to get We then proceed using the instructions as in Example 9 to get The mean is 99:6 minutes. The median is 105 minutes. The mean is 99:6 minutes. The median is 105 minutes. Note: The median is given here as one of the midpoints entered. Why? Note: The median is given here as one of the midpoints entered. Why? EXERCISE 8D 1 Consider the following two data sets: Data set A: 3, 4, 4, 5, 6, 6, 7, 7, 7, 8, 8, 9, 10 Data set B: 3, 4, 4, 5, 6, 6, 7, 7, 7, 8, 8, 9, 15 a b c d Find the mean for both Data set A and Data set B. Find the median of both Data set A and Data set B. Explain why the mean of Data set A is less than the mean of Data set B. Explain why the median of Data set A is the same as the median of Data set B. 2 The back-to-back stemplot below shows the points per game scored by two basketballers, Erin and Tracy: Erin Leaf 9 875 76411 8420 1 Stem 0 1 2 3 4 Leaf 4478 012689 359 11 Tracy 3j1 represents 31 points cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 a Find the mean for each player. b Find the median for each player. c Why is the median for Erin not one of the points per game listed? d Which player generally scored more points per game? black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\471SA12MA-2_08.CDR Thursday, 16 August 2007 4:47:39 PM DAVID3 SA_12MA-2 472 STATISTICS (Chapter 8) 3 The frequency table alongside records the number of phone calls made in a day by 50 thirteen-year-olds and 50 eighteenyear-olds. Number of 13 year old phone calls Frequency 0 5 1 8 2 13 3 8 4 6 5 3 6 3 7 2 8 1 9 0 10 0 11 1 a For both sets of data, find the: i mean ii median. b Why is the mean larger than the median for the thirteen-year-old data? c Why are the mean and median approximately equal for the eighteenyear-old data? 18 year old Frequency 1 2 3 4 4 6 8 7 5 4 3 3 4 The weights of a squad of AFL players are compared with those of NRL players. Weight (kg) 70 - < 80 80 - < 90 90 - < 100 100 - < 110 110 - < 120 a b c d Number of AFL players 8 10 12 3 2 Number of NRL players 0 3 9 11 7 How many players were weighed in each squad? Calculate the mean weight for players in each squad. Find the median weight for players in each squad. Which squad generally has heavier players? 5 A tennis club has 450 members listed on its database. The population mean of member’s ages has been calculated at 28:3. The marketing department wants to survey members on their preferred social activities at the club. The age breakdowns of members at the club are: Age range under 18 18 - < 30 30 - < 50 over 50 No. of members 62 211 103 74 a How many of each age range should be surveyed if a stratified sample of 20 members is used? The marketing department noted the ages of the 20 members surveyed. The results were: 10, 72, 25, 44, 52, 15, 17, 62, 60, 32, 19, 23, 48, 37, 21, 27, 35, 25, 26, 29 cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 Find the sample mean age of the members surveyed. Why is this different to the mean age of the member population? Suggest how the sample mean age could better reflect the population mean age. The marketing department decided to conduct another survey of 40 members. Discuss the reliability of the sample mean age of this sample in comparison to the sample mean age of the first sample of 20 members. 5 95 100 50 75 25 0 5 b c d e black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\472SA12MA-2_08.CDR Thursday, 16 August 2007 4:47:45 PM DAVID3 SA_12MA-2 STATISTICS 6 A school has 820 families listed on the enrolment database. The income ranges of the families is listed alongside: Income range $0 - < $30 000 $30 000 - < $60 000 $60 000 - < $90 000 $90 000 - < $120 000 $120 000 - < $150 000 (Chapter 8) 473 Number of families 56 214 445 73 32 a Calculate an estimate of the mean income of all families at the school. The school bursar wanted to survey a sample of 30 families to determine their reaction to an increase in school fees.¡ She selected 30 families at random for this purpose and at the same time she asked them to record their income for the last year.¡ The results were: $45 000 $54 000 $38 000 $85 000 $75 000 $21 000 $123 000 $47 000 $121 000 $29 000 $145 000 $95 000 $52 000 $46 000 $55 000 $132 000 $112 000 $63 000 $134 000 $115 000 $94 000 $127 000 $78 000 $102 000 $89 000 $72 000 $29 000 $83 000 $62 000 $54 000 b Calculate the mean income of the sample of 30 families. c Why is this different from the estimate of the population mean calculated in a? d If a stratified sample of 30 families was used, calculate the number of families in the $120 000 - < $150 000 income range that should be surveyed. e How many families in the $120 000 - < $150 000 income range were actually surveyed? f Discuss the reliability of a stratified sample of 30 families as compared to a simple random sample of 30 families as done in a. g How could an even more reliable sample be obtained? CHOOSING THE APPROPRIATE MEASURE OF THE CENTRE The mean and median can be used to indicate the centre of a set of numbers. Which of these values is the most appropriate measure to use will depend upon the type of data under consideration. In real estate values the median is used as a measure of the centre. Why is this? When selecting which of the measures of central tendency to use as a representative figure for a set of data, you should keep the following advantages and disadvantages of each measure in mind. cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 Mean ² The mean’s main advantage is that it is commonly used, easy to understand and easy to calculate. 25 0 5 95 100 50 75 25 0 5 I black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\473SA12MA-2_08.CDR Thursday, 16 August 2007 4:47:51 PM DAVID3 SA_12MA-2 474 STATISTICS ² (Chapter 8) Its main disadvantage is that it is affected by extreme values within a set of data and so may give a distorted impression of the data. For example, consider the following data: 4, 6, 7, 8, 19, 111: The total of these 6 numbers is 155, and so the mean is approximately 25:8. Is 25:8 a representative figure for the data? The extreme value (or outlier) of 111 has distorted the mean in this case. I Median ² The median’s main advantage is that it is easily calculated and is the middle value of the data. ² Unlike the mean, it is not affected by extreme values. ² The main disadvantage is that it ignores all values outside the middle range and so its representativeness is questionable. Because the mean is unable to resist the influence of extreme values it is a non-resistant measure of the centre. The median is a resistant measure. Note: E THE VARIABILITY (SPREAD) OF A DISTRIBUTION We use two measures to describe a distribution.¡ These are its centre and its variability (or spread). The distributions shown have the same mean, but clearly they have a different spread.¡ For example, the A distribution has most scores close to the mean whereas the C distribution has greater spread. A B C mean ² ² ² Three commonly used statistics that indicate the spread of a set of data are: the range the interquartile range the standard deviation. THE RANGE AND INTERQUARTILE RANGE The range is the difference between the maximum (largest) data value and the minimum (smallest) data value. Range = maximum data value ¡ minimum data value Example 9 Self Tutor cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 A greengrocer chain is to purchase apples from two different wholesalers. They take six random samples of 50 apples to examine them for skin blemishes. The counts for the number of blemished apples are: Wholesaler Redapp 5 17 15 3 9 11 Wholesaler Pureapp 10 13 12 11 12 11 What is the range of blemished apples from each wholesaler? black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\474SA12MA-2_08.CDR Thursday, 16 August 2007 4:48:01 PM DAVID3 SA_12MA-2 STATISTICS Range = 17 ¡ 3 = 14 Range = 13 ¡ 10 = 3 Wholesaler Redapp Wholesaler Pureapp Note: Note: 475 (Chapter 8) This shows that Wholesaler Redapp has more variability in the number of skin blemished apples per sample of 50. The range is not considered to be a particularly reliable or resistant measure of spread as it uses only two data values. THE INTERQUARTILE RANGE (Review) The median divides the ordered data set into two halves and these halves are divided in half again by the quartiles. The middle value of the lower half is called the lower quartile (Q1 ). One-quarter, or 25%, of the data have a value less than or equal to the lower quartile. 75% of the data have values greater than or equal to the lower quartile. The middle value of the upper half is called the upper quartile (Q3 ). One-quarter, or 25%, of the data have a value greater than or equal to the upper quartile. 75% of the data have values less than or equal to the upper quartile. Interquartile range = upper quartile ¡ lower quartile The interquartile range is the range of the middle 50% of the data. The data set has been divided into quarters by the lower quartile (Q1 ), the median (Q2 ) and the upper quartile (Q3 ). IQR = Q3 ¡ Q1 . So, the interquartile range, Example 10 Self Tutor For the data set 5 5 7 3 8 2 3 4 6 5 7 6 4 find the: a median b lower quartile c upper quartile d interquartile range The ordered data set is 2 3 3 4 4 5 5 5 6 6 7 7 8 a There are 13 data values so the median is the 7th value (circled). There is an odd number of data and the median is one of the values so it divides the data into two halves of six values each. Note: For an odd number of data the median data value is not included in the lower or upper half for the calculation of the quartiles. b The middle value of the lower half is the average of the 3rd and 4th values. 6 values 6 values z }| { z }| { 2 3 3 4 4 5 5 5 6 6 7 7 8 3:5 cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 3+4 = 3:5 2 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 Lower quartile = median black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\475SA12MA-2_08.CDR Thursday, 16 August 2007 4:48:12 PM DAVID3 SA_12MA-2 476 STATISTICS (Chapter 8) Similarly, the middle value of the upper half is the average of the 10th and 11th values: 2 3 3 4 4 5 5 5 6 6 7 7 8 c 6:5 6+7 = 6:5 2 Upper quartile = Interquartile range = upper quartile ¡ lower quartile = 6:5 ¡ 3:5 = 3 So, the middle half of the data has a spread of 3. d A summary for the set of data in Example 10 is: Range = 8¡2 = 6 The data has a spread of 6 (range = 6), centred around the value 5 (median = 5). The middle half of the data has a spread of 3 (interquartile range = 3). 2 3 3 4 4 5 5 5 6 6 7 7 8 3:5 5 Lower quartile 6:5 Median Upper quartile = 3 Interquartile range Although they give useful information, the range and the interquartile range are not as useful as the standard deviation as a measure of spread. The range uses only two data values and the interquartile range ignores the lowest and highest quarters of the data. The standard deviation is an average variation from the mean of all data values. For a set of n data values of x: x1 , x2 , x3 , x4 , ..... , xn then: sP (x ¡ x)2 is the standard deviation for a sample with mean x. s= n¡1 Example 11 Self Tutor Find the standard deviations for the apple samples of Example 9. Wholesaler Redapp cyan magenta ) x= yellow 95 100 50 75 25 0 5 95 100 50 + 5:48 75 25 0 5 95 60 = 10 6 rP (x ¡ x)2 and s = n¡1 r 150 = 5 (x ¡ x)2 25 49 25 49 1 1 150 100 50 75 x¡x ¡5 7 5 ¡7 ¡1 1 Total 25 0 5 95 100 50 75 25 0 5 x 5 17 15 3 9 11 60 black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\476SA12MA-2_08.CDR Thursday, 16 August 2007 4:48:18 PM DAVID3 SA_12MA-2 STATISTICS 477 (Chapter 8) Wholesaler Pureapp x 10 13 12 11 12 11 69 69 = 11:5 6 rP (x ¡ x)2 s = n¡1 r 5:5 = 5 (x ¡ x)2 2:25 2:25 0:25 0:25 0:25 0:25 5:5 x¡x ¡1:5 1:5 0:5 ¡0:5 0:5 ¡0:5 Total ) x= = 1:05 Clearly, Wholesaler Pureapp supplied apples with more blemishes but with less variability (smaller standard deviation) than for those supplied by Redapp. The formula and example above is included for completeness and to give you an idea of how the standard deviation is calculated. In this course, you should concentrate on using technology to find the standard deviation. Note: USING THE CALCULATOR TO FIND THE MEASURES OF SPREAD We will concentrate on using technology to find the measures of spread. Example 12 Self Tutor Find the three measures of spread for the number of goals thrown by a netballer in 18 games: 8, 4, 3, 9, 6, 5, 5, 10, 3, 6, 7, 9, 11, 14, 9, 8, 7, 12 We key the data into a list. The data does not have to be ordered. For TI-83 we obtain The range = maxX ¡ minX = 14 ¡ 3 = 11 The standard deviation is 3:05 The IQR = Q3 ¡ Q1 =9¡5 =4 For Casio we obtain The range = maxX ¡ minX = 14 ¡ 3 = 11 The standard deviation is 3:05 cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 The IQR = Q3 ¡ Q1 =9¡5 =4 black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\477SA12MA-2_08.CDR Thursday, 16 August 2007 4:48:25 PM DAVID3 SA_12MA-2 478 STATISTICS (Chapter 8) EXERCISE 8E 1 Netballers Sally and Joanne compare their goal throwing scores for the last 8 matches. 23 9 Goals by Sally Goals by Joanne 17 29 31 41 25 26 25 14 19 44 28 38 32 43 a Find the mean and standard deviation for the number of goals thrown by each goal shooter for these matches. b Which measure is used to determine which of the goal shooters is more consistent? 2 Two cricketers compare their bowling performances for the last ten test matches. The number of wickets per match was: Glen 0 10 1 Shane 4 3 4 9 11 0 8 1 4 11 7 5 6 7 6 12 5 a Show that each bowler has the same mean and range. b Which performance do you suspect is more variable, Glen’s bowling over the period or Shane’s? c Check your answer to b by finding the IQR and standard deviation for each distribution. d Does the range, IQR or the standard deviation give a better indication of variability? 3 A manufacturer of softdrinks employs a statistician for quality control. Suppose that he needs to check that 375 mL of drink goes into each can. The machine which fills the cans may malfunction or slightly change its delivery due to constant vibration or other factors. a Would you expect the standard deviation for the whole production run to be the same for one day as it is for one week? Explain. b If samples of 125 cans are taken each day, what measure would be used to: i check that 375 mL of drink goes into each can ii check the variability of the volume of drink going into each can. c What is the significance of a low standard deviation in this case? 4 Two groups of students are given pairs of shoes to wear to school. The first group (X) have original rubber soled shoes and the second group (Y) have the new synthetic rubber soled shoes. The data below shows the thickness of the soles of the shoes after six months. Group X: 3, 5, 6, 4, 5, 6, 2, 7, 3, 4, 4, 6, 5, 5, 5, 7, 6, 4, 4, 3, 6, 5, 4, 2 Group Y: 4, 6, 5, 4, 3, 5, 6, 6, 7, 6, 6, 4, 5, 7, 8, 6, 7, 5, 3, 6, 6, 7, 5 cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 a Find the median, lower and upper quartiles for each distribution. b Find the range and IQR of each distribution. c Is the new synthetic rubber on the soles of shoes an improvement? Give evidence. black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\478SA12MA-2_08.CDR Thursday, 16 August 2007 4:48:31 PM DAVID3 SA_12MA-2 STATISTICS 479 (Chapter 8) DISCUSSION Consider the range, interquartile range and standard deviation. Which of these measures is resistant and which is non-resistant as a measure of spread? F BOX AND WHISKER PLOTS A box and whisker plot (or simply a boxplot) is a visual display of some of the descriptive statistics of a data set. It shows: 9 ² the minimum value (Minx ) > > > > ² the lower quartile (Q1 ) = These five numbers form the ² the median (med) > five-number summary of a data set. > > ² the upper quartile (Q3 ) > ; ² the maximum value (Maxx ) CONSTRUCTING A BOXPLOT A boxplot (box-and-whisker plot) is constructed above a number line (labelled and scaled) which is drawn so that it covers all the data values in the data set. The boxplot is drawn with a rectangular ‘box’ representing the middle half of the data. The ‘box’ goes from the lower quartile to the upper quartile. The ‘whiskers’ extend from the ‘box’ to the maximum value and to the minimum value. A vertical line marks the position of the median in the ‘box’. For example, for the data set 2, 3, 5, 4, 3, 6, 5, 7, 3, 8, 1, 7, 5, 5, 9: 7 values 7 values z }| { z }| { 1, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 7, 7, 8, 9 (15 data) The ordered data set is Q1 The The The The The median Q2 minimum is 1. maximum is 9. median is the 8th value, 5. lower quartile is the 4th value, 3. upper quartile is the 12th value, 7. 9 > > > > > = Q3 These 5 statistics form the > five-number summary. > > > > ; whisker 1 whisker 2 minimum 3 4 5 lower quartile 6 median 7 upper quartile 8 9 value maximum USING A GRAPHICS CALCULATOR TO CONSTRUCT A BOXPLOT cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 Consider the data: 2, 3, 5, 4, 3, 6, 5, 7, 3, 8, 1, 7, 5, 5, 9 black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\479SA12MA-2_08.CDR Thursday, 16 August 2007 4:48:37 PM DAVID3 SA_12MA-2 480 STATISTICS (Chapter 8) For a TI-83 For a Casio Press STAT 1 to select 1:Edit. From the Main Menu, select STAT Enter the data from the example above into List1. Press F6 (¤) until the GRPH icon is in the bottom left corner of the screen. Press F1 (GRPH) F6 (SET), then Enter the data from the example above into L1. Statistical graphs are drawn using STAT PLOT, which is located above the Y= key. Press 2nd Y= to use it. F6 (¤) F2 (BOX) to choose the boxplot Press ENTER to use Plot1. Check that the XList variable is set to List1, then press EXIT F4 (SEL) F1 (On) to turn StatGraph1 on. Turn the plot On by pressing ENTER then use the arrow keys to choose the boxplot icon Ö and press ENTER . Press F6 (Draw) to draw the boxplot Press ZOOM 9 to select 9:ZoomStat and draw the boxplot. Pressing F1 (1VAR) gives the statistics of the data, including the five-number summary. TRACE can be used to locate the statistics of the five-number summary. The arrow keys move backwards and forwards between them. cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 In this screen, the cursor is on the median. black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\480SA12MA-2_08.CDR Monday, 20 August 2007 10:15:58 AM DAVID3 SA_12MA-2 STATISTICS 481 (Chapter 8) INTERPRETING A BOXPLOT A set of data with a symmetric distribution will have a symmetric boxplot. For example: y 8 6 4 10 11 12 13 14 15 16 17 18 19 20 x 2 0 x 10 11 12 13 14 15 16 17 18 19 20 The whiskers of the boxplot are the same length and the median line is in the centre of the box. A set of data which is positively skewed will have a positively skewed boxplot. For example: y 10 8 6 4 1 2 3 4 5 6 7 8 x 2 0 1 2 3 4 5 6 7 x 8 The right whisker is longer than the left whisker and the median line is to the left of the box. A set of data which is negatively skewed will have a boxplot that appears stretched to the left. For example: 1 2 3 4 5 6 7 8 9 x x 1 2 3 4 5 6 7 8 9 cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 The left whisker is longer than the right and the median line is to the right of the box. black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\481SA12MA-2_08.CDR Thursday, 16 August 2007 4:48:49 PM DAVID3 SA_12MA-2 482 STATISTICS (Chapter 8) Example 13 Self Tutor female n¡=¡34 male n¡=¡26 0 5 10 15 20 25 30 35 40 45 age (years) A conservation park in Sri Lanka is home to 60 elephants, of which 34 are females and 26 are males. The parallel boxplots above show the distribution of their ages by sex. a What sex was the youngest elephant? b How old is the oldest female elephant? c Compare the range of ages by sex and interpret. d Compare the median age of each sex. e The youngest 25% of female elephants are aged between ...... and ...... f 75% of male elephants are aged under ...... g Comment on the shape of each distribution. a The youngest elephant was aged 2 and is male. b The oldest female elephant is 36 12 years old. c Male: Range = 43 ¡ 2 = 41 Female: Range = 36 12 ¡ 4 = 32 12 ) the range of male ages is larger indicating greater variability. d The median age is 23 for both males and females. e The youngest 25% of female elephants are aged between 4 and 14. f 75% of male elephants are aged under 30 12 . g The distribution of male ages is roughly symmetrical, whilst the distribution of female ages is stretched to the left and hence is negatively skewed. EXERCISE 8F.1 Box and whisker plots are often used to compare data. 1 The following box and whisker plots are for weekly motor vehicle sales at two large car yards owned by the same business. Yard A Yard B 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 2 3 4 5 6 7 8 9 10 11 12 cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 a Find the range of each distribution. b Find the median of each distribution. c Find the interquartile range of each distribution. d Do the boxplots enable you to deduce the more effective sales yard? black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\482SA12MA-2_08.CDR Thursday, 16 August 2007 4:48:54 PM DAVID3 SA_12MA-2 STATISTICS 2 The given side by side box and whisker plots compare the results of a science test and a retest of the same topic. a Write a brief account comparing the medians, ranges and IQR’s. 483 Test A 3 b Do you think that the group of students have improved their understanding of the topic due to the re-test? 4 5 6 7 8 9 Re-test B 3 A large hardware chain is examining 50 mm diameter PVC pipe from three different manufacturers.¡ When the data is analysed, boxplots are constructed of measurements of the internal diameters of randomly selected pipes.¡ The boxplots are shown alongside. A B C Which manufacturer should the hardware chain use if: a they want a consistent diameter (small variability) b c (Chapter 8) 49.8 49.9 50 50.1 50.2 50.3 they wanted the largest diameter they want a diameter as close as possible to 50 mm? 4 The boxplots alongside compare the time students in years 10 and 12 spend on homework over a one week period. Year 10 Year 12 a Find the 5-number 0 5 10 15 summaries for both the year 10 and year 12 students. b Determine the i range ii interquartile range for each group. 5 Two classes have completed the same test.¡ Boxplots have been drawn to summarise and display the results.¡ They have been drawn on the same set of axes so that the results can be compared. a In which class was: i the highest mark ii the lowest mark iii there a larger spread of marks? b Find: i the range of marks in class B ii the interquartile range for class A. 90 20 hours test score 80 70 60 50 40 30 Class A Class B c If the pass mark was 50 for the test what percentage of students passed the test in: i class A ii class B? d Describe the distribution of marks in: i class A ii class B. cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 e Copy and complete: The students in class ....... generally scored higher marks. The marks in class ...... were more varied. black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\483SA12MA-2_08.CDR Thursday, 16 August 2007 4:48:59 PM DAVID3 SA_12MA-2 484 STATISTICS (Chapter 8) TESTING FOR OUTLIERS Outliers are extraordinary data that are either much larger or much smaller than the main body of data. One commonly used test for outliers involves the following calculation of ‘boundaries’: The upper boundary = upper quartile + 1:5 £ IQR. Any data larger than this number is an outlier. The lower boundary = lower quartile ¡ 1:5 £ IQR. Any data smaller than this value is an outlier. When outliers exist, the ‘whiskers’ of a boxplot extend to the last value that is not an outlier. Outliers are marked by an asterisk. It is possible to have more than one outlier at either end. Example 14 Self Tutor Use technology to draw a boxplot for the following data, identifying any outliers. 1, 3, 7, 8, 8, 5, 9, 9, 12, 14, 7, 1, 4, 8, 16, 8, 7, 9, 10, 13, 7, 6, 8, 11, 17, 7 For TI-83 For Casio We enter the data in L1. We enter the data in List 1. Use STAT PLOT.¡ Press 2nd Press ENTER to use Plot1. Press F6 (¤) until the GRPH icon is in the bottom left corner of the screen. Turn the plot On then use the arrow keys to choose the ‘boxplot with outliers’ icon Press F1 (GRPH) F6 (SET), Y= . and then F6 (¤) F2 (Box) to select boxplot.¡ Also set Outliers to On. Then press ENTER . cyan magenta yellow 95 100 50 75 25 0 The data points at 1, 16 and 17 are highlighted as outliers. 5 95 100 50 75 25 0 5 95 100 50 75 Press F6 (Draw) to draw the boxplot. 25 Press TRACE and use the arrow keys to move the cursor through the summary statistics.¡ Note that both values at 1 are included as are 16 and 17. 0 Press EXIT F4 (SEL) F1 (On) to turn StatGraph1 on. 5 95 Note that only one of the outliers at 1 appears on the screen. 100 50 75 25 0 5 Press ZOOM 9 to select 9:ZoomStat and draw the boxplot. black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\484SA12MA-2_08.CDR Monday, 20 August 2007 10:16:10 AM DAVID3 SA_12MA-2 STATISTICS 485 (Chapter 8) We now sketch the boxplot: Two outliers of the same value are shown like this. The whisker is drawn to the last value that is not an outlier. variable 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 CONSTRUCTING PARALLEL BOXPLOTS A graphics calculator can be used to construct parallel boxplots which can then be interpreted and compared. Consider the office workers from page 464 who recorded travel times to work by train or tram. 21, 25, 18, 13, 33, 27, 28, 14, 18, 43, 19, 22, 30, 22, 24 23, 18, 16, 16, 30, 20, 21, 18, 18, 17, 20, 21, 28, 17, 16 Tram travel times (minutes): Train travel times (mintues): If, in addition, car travel time data is available to the office worker, we can use parallel boxplots to compare the data. They help us decide which type of transport is the quickest to get him to work and which is the most reliable. Car travel times (minutes): 30, 21, 19, 17, 24, 28, 23, 25, 25, 16, 18, 19, 29, 22 PARALLEL BOXPLOTS FROM A CALCULATOR For TI-83 Press STAT 1 to select 1:Edit. The data for each of the boxplots is entered in a separate list. Press 2nd Y= to select STAT PLOT. Press ENTER to access Plot1. Make sure that Plot1 is On, the “boxplot with outliers” icon is selected, and the XList is set to L1. Use the key to return the cursor to the top of the screen, then press ENTER to access Plot2. Adjust the settings to Plot2 so they match Plot1, except set the XList variable to L2, by pressing 2nd 2 (L2). until Plot3 Return the cursor to the top of the screen, press is highlighted, then press ENTER . Again match the settings of Plot3 with those of Plot1, except set the XList variable to L3. ZOOM 9 (9:ZoomStat) will bring the graphs to the screen: cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 TRACE , then the arrows, can be used to find ‘5-number summary’ values on the screen. black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\485SA12MA-2_08.CDR Monday, 20 August 2007 10:16:21 AM DAVID3 SA_12MA-2 486 STATISTICS (Chapter 8) For Casio From the Main Menu, select STAT. The data for each of the boxplots is entered in a separate list. Press F6 (¤) until the GRPH icon is in the bottom left corner of the screen. Press F1 (GRPH) F6 (SET) to access StatGraph1. F6 (¤) F2 (Box) to select the boxplot, press Press F1 (List1) to set the XList variable to List1, then set the Outliers to On. Use the key to return the cursor to the top of the screen, then press F2 (GPH2) to access StatGraph2. Adjust the settings of StatGraph2 so they match StatGraph1, except set the XList variable to List2. Return the cursor to the top of the screen, then press F6 (GPH3) to access StatGraph3. Again match the settings of StatGraph3 with those of StatGraph1, except set the XList variable to List3. Press EXIT F4 (SEL), and make sure all three graphs are set to DrawOn. Press F6 (Draw) to draw the boxplots. The three boxplots are drawn on the one axis: tram train categorical variable with three categories car 10 15 20 25 30 35 40 45 travel time (minutes) numerical variable The car travel times have almost the same spread (range = 14 mins, IQR = 6 mins) as the train travel times (range = 14 mins, IQR = 4 mins), suggesting that the car travel time is as reliable as the train travel time. However, the train travel times include two outliers which may be due to extraordinary events. If these are ignored then the range of travel times for the train would be 7 minutes, which is considerably less than the ranges for the car and tram. The median car travel time is 22:5 minutes, compared to 18 minutes for the train and 22 minutes for the tram, so it is still generally quicker to travel by train. cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 In conclusion: From the data given, it is generally quicker and more reliable to travel by train than it is by either tram or car. black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\486SA12MA-2_08.CDR Monday, 20 August 2007 10:16:30 AM DAVID3 SA_12MA-2 STATISTICS (Chapter 8) 487 EXERCISE 8F.2 1 The daily maximum temperatures in Melbourne for June 21st and December 21st (the equinoxes) are being compared. The data for the 20 years from 1987 to 2006 is given below: June 21st: 13:6, 10:6, 19:1, 14:2, 12:2, 11:9, 18:3, 14:9, 14:6, 15:1, 17:4, 13:5, 16:7, 14:0, 11:1, 17:0, 15:4, 16:3, 15:6, 36:3 December 21st: 24:2, 19:4, 21:4, 22:7, 21:4, 20:0, 22:3, 21:1, 18:9, 23:5, 21:3, 23:0, 28:1, 20:3, 17:2, 35:0, 33:7, 21:9, 21:4, 38:6 a Construct parallel boxplots for the data. b Are any outliers able to be identified? c Copy and complete the given table: d Compare and comment on the two data sets. e The outlier of 36:3o on June 21st is clearly a mistake in recording! It should have been 16:3o : Complete the following table: June 21st Dec. 21st Mean Median Range IQR Stand. deviation June 21st with 36:3o June 21st with 16:3o Mean Median Range IQR Standard deviation f Discuss the effect on each of the measures of centre and spread above after the removal of the outlier. cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 2 The heights (to the nearest centimetre) of boys and girls in a school year are as follows: Boys 164 168 175 169 172 171 171 180 168 168 166 168 170 165 171 173 187 179 181 175 174 165 167 163 160 169 167 172 174 177 188 177 185 167 160 123 205 Girls 165 170 158 166 168 163 170 171 177 169 168 165 156 159 165 164 154 170 171 172 166 152 169 170 163 162 165 163 168 155 175 176 170 166 a Construct parallel boxplots for the data. b Are there any outliers present? There are no boys in the year group with a height of 123 cm, but there is one giant of 205 cm! Remove the 123 cm from the set of data. c Calculate the effect of removing the outlier on the: i mean ii median iii range iv IQR v standard deviation. d Compare and comment on the distribution of the two data sets with the outlier removed. black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\487SA12MA-2_08.CDR Thursday, 16 August 2007 4:49:22 PM DAVID3 SA_12MA-2 488 STATISTICS (Chapter 8) 3 Batting averages for Australian and Indian teams for the 2001 test series in India were: Australia 109:8, 48:6, 47:0, 33:2, 32:2, 29:8, 24:8, 20:0, 10:8, 10:0, 6:0, 3:4, 1:0 India 83:83, 56:33, 50:67, 28:83, 27:00, 26:00, 21:00, 20:00, 17:67, 11:33, 10:00, 6:00, 4:00, 4:00, 1:00, 0:00 a Construct parallel boxplots for the data, displaying outliers. b Compare and comment on the centres and spread of the data sets. c Should any outliers be discarded and the data be reanalysed? STATISTICS PACKAGE USING A STATISTICAL COMPUTER PROGRAM Click on the icon to produce a computer program which will enable you to compare data, obtain statistics and draw graphs of comparison.¡ You can then print it all. G EXTENDED INVESTIGATIONS EXERCISE 8G 1 Shane and Brett play in the same cricket team and are fierce but friendly rivals when it comes to bowling. During a season the number of wickets per innings taken by each bowler was recorded as: Shane: 1 6 2 0 3 4 1 4 2 3 0 3 2 4 3 4 3 3 3 4 2 4 3 2 3 3 0 5 3 5 3 2 4 3 4 3 7 2 4 8 1 3 4 2 3 0 5 3 5 2 4 3 4 0 3 3 0 2 5 1 1 2 2 5 Is the data discrete or continuous? Enter the data into a graphics calculator or statistics package. Produce side-by-side boxplots for the data. Are there any outliers? Should they be deleted before we start to analyse the data? Describe the shape of each distribution. Compare the measures of the centre of each distribution. Compare the spreads of each distribution. What conclusions, if any, can be drawn from the data? Brett: a b c d e f g h 3 1 1 4 2 0 0 1 2 A manufacturer of light globes claims that the newly invented type has a life 20% longer than the current globe type. Forty of each globe type are randomly selected and tested. Here are the results to the nearest hour. Old type: 103 96 113 111 126 100 122 110 84 117 111 87 90 121 99 114 105 121 93 109 87 127 117 131 115 116 82 130 113 95 103 113 104 104 87 118 75 111 108 112 New type: 146 131 132 160 128 119 133 117 139 123 191 117 132 107 141 136 146 142 123 144 133 124 153 129 118 130 134 151 145 131 109 129 109 131 145 125 164 125 133 135 cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 a Is the data discrete or continuous? b Enter the data into a graphics calculator or statistics package and obtain side-by-side boxplots. black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\488SA12MA-2_08.CDR Thursday, 16 August 2007 4:49:28 PM DAVID3 SA_12MA-2 STATISTICS c d e f 489 (Chapter 8) Are there any outliers? Should they be deleted before we start to analyse the data? Compare the measures of centre and spread. Use b to describe the shape of each distribution. What conclusions, if any, can be drawn from the data? 3 Plant fertilisers come in many different brands, but there are essentially two types: organic and inorganic. A student was interested to discover whether radish plants responded better to organic or inorganic fertiliser. He prepared three identical plots of ground, named plots A, B and C, in his mother’s garden, and planted 40 radish seeds in each plot. After planting, each plot was treated in an identical manner, except for the way they were fertilised. Cost prevented him using a variety of fertilisers, so he chose one organic and one inorganic fertiliser. Plot A received no fertiliser, plot B received the organic fertiliser as prescribed on the packet, and plot C received the inorganic fertiliser as prescribed on the packet. The student was interested in the weight of the root that forms under the ground. The data below is the weight of the root (measured to the nearest gram) of the individual plants: Data from plot A: 27 29 9 10 8 36 36 42 32 32 32 30 38 32 30 39 38 50 34 41 39 40 12 14 35 35 42 25 34 22 Data from plot B: 51 54 56 41 50 47 47 46 48 52 34 20 28 45 58 47 58 56 63 66 54 48 48 53 47 29 46 33 34 Data from plot C: 55 76 65 61 67 69 68 64 76 59 56 79 70 65 47 69 70 76 43 70 62 60 58 79 65 75 60 39 50 66 68 68 63 54 61 72 58 77 a Produce parallel boxplots for the data. b Compare and comment on the distributions of the weights of the root for each plot, mentioning the shape, centre and spread and quoting statistics to support your statements. INVESTIGATION 2 KARELINE’S REAL ESTATE DATA Open the spreadsheet on Kareline’s real estate data. SPREADSHEET What to do: cyan magenta yellow 95 100 50 In F3 enter =QUARTILE (C:C,0) In F5 enter =QUARTILE (C:C,2) In F7 enter =QUARTILE (C:C,4) 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 1 In F2 enter =COUNT(C:C). In F4 enter =QUARTILE (C:C,1) In F6 enter =QUARTILE (C:C,3) black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\489SA12MA-2_08.CDR Thursday, 16 August 2007 4:49:34 PM DAVID3 SA_12MA-2 490 STATISTICS (Chapter 8) 2 In F9 enter =AVERAGE (C:C) and in F10 enter =STDEV(C:C) 3 Draw a box and whisker plot of the real estate data. 4 Obtain either real estate data or weekly rental data of flats from two or more different suburbs. Find 5-number summaries for each and appropriate boxplots using the given spreadsheet. 5 Write a summary of your findings from 4 (no more than 100 words). The emphasis is to be on comparing the two data sets. INVESTIGATION 3 HOW DO YOU LIKE YOUR EGGS? This investigation examines the weight and dimensions of eggs.¡ Because you will need to collect the data for at least 5 dozen eggs, it is suggested that you work with at least one, and preferably three other people. Eggs are sold in three categories: small, medium or large. Decide which category your group will use. Using a set of electronic scales, measure the weight of at least five dozen eggs in the category of your choice. Use a set of electronic calipers to measure the length and maximum diameter of each egg. Record your results in a spreadsheet. Use the spreadsheet to organise the data in three separate ways. By weight By length By width ² in classes of 0:1 g ² in classes of 0:2 g ² without classes ² in classes of 0:1 mm ² in classes of 0:2 mm ² without classes ² in classes of 0:1 mm ² in classes of 0:2 mm ² without classes cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 Present your results in a suitable table. Draw side by side box and whisker plots for each set of data. In a brief report, comment on your results. Discuss the characteristics of a typical egg in the category you have studied. By chance, you hear a suggestion that hens of different breeds produce eggs of different sizes. Discuss how you would set about examining that conjecture. On the basis of your work in this investigation, discuss the characteristics you would expect to find for a different category of eggs. How many eggs do you think it would take to test your conjecture? If someone brought an egg to you and asked if it was of the category you had measured, how would you check? black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\490SA12MA-2_08.CDR Thursday, 16 August 2007 4:49:40 PM DAVID3 SA_12MA-2 STATISTICS INVESTIGATION 4 491 (Chapter 8) HEART STOPPER A new drug that is claimed to lower the cholesterol level in humans has been developed. A heart specialist was interested to know if the claims made by the company selling the drug were accurate.¡ He enlisted the help of 50 of his patients. They agreed to take part in an experiment in which 25 of them would be randomly allocated to take the new drug and the other 25 would take an identical looking pill that was actually a placebo (a sugar pill that would have no effect at all). All participants had their cholesterol level measured before starting the course of pills and then at the end of two months of taking the drug, they had their cholesterol level measured again. The data collected by the doctor is given below. cholesterol levels of all participants before the experiment 7:1 6:7 6:2 6:0 6:3 8:2 7:3 7:0 5:0 6:2 8:4 8:9 8:1 8:3 8:5 6:5 6:2 8:4 7:9 5:0 6:5 6:3 6:4 6:7 6:6 7:1 7:1 7:6 7:3 8:1 7:2 8:4 8:6 6:0 6:8 7:1 7:4 7:5 7:4 7:5 6:1 7:6 7:9 7:4 6:5 6:0 7:5 6:2 8:6 7:6 cholesterol levels of the 25 participants who took the drug 4:8 4:4 4:7 5:6 4:7 4:7 4:7 4:9 5:1 4:2 6:2 4:6 4:8 4:7 5:2 4:6 4:7 4:8 4:4 5:2 5:6 4:8 4:2 5:0 4:4 cholesterol levels of the 25 participants who took the placebo 7:0 5:7 8:2 8:4 8:3 7:5 8:8 7:9 6:0 6:1 6:7 7:6 6:6 7:3 6:1 7:6 6:1 6:5 7:4 7:9 8:4 6:2 6:6 6:8 6:5 What to do: 1 Produce a single stemplot for the cholesterol levels of all participants after the experiment. Present the stemplot so that this data can be simply compared to all the measurements before the experiment began. STATISTICS 2 Use technology to calculate the relevant statistical data. PACKAGE 3 Use the data to complete the table: Cholesterol Level 4:0 4:5 5:0 5:5 - Before 25 participants 25 participants Experiment taking the drug taking the placebo < 4:5 < 5:0 < 5:5 < 6:0 .. . 8:5 - < 9:0 4 Calculate the mean and standard deviation for each group in the table. cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 5 Write a report presenting your data and findings based on that data. black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\491SA12MA-2_08.CDR Thursday, 16 August 2007 4:49:46 PM DAVID3 SA_12MA-2 492 STATISTICS (Chapter 8) H NORMAL DISTRIBUTIONS Many data sets have frequency distributions that are ‘bell-shaped’ and symmetrical about the mean. For example, the histogram alongside exhibits this typical ‘bell-shape’. The data represents the heights of a group of adult women and has a mean of 165 cm and a standard deviation of 8 cm. The data is centred about the mean and spreads from 140 cm to 190 cm. However, most of the data have values between 155 cm and 170 cm and not many have values more than 180 cm or less than 150 cm. frequency 25 20 15 10 5 0 140 145 150 155 160 165 170 175 180 185 190 height (cm) THE NORMAL DISTRIBUTION CURVE On the right we have the graph of the normal distribution of scores. Notice its symmetry. relative frequency The normal distribution curve. mean ` x = median The normal distribution is a theoretical, or idealised model of many real life distributions. In a normal distribution, data is equally distributed about the mean. The mean also coincides with the median of the data. The normal distribution lies at the heart of statistics. Many naturally occurring phenomena have a distribution that is normal, or approximately normal. ² ² ² ² ² ² ² Some examples are: the chest sizes of Australian males the distribution of errors in many manufacturing processes the lengths of adult female tiger sharks the length of cilia on a cell scores on tests taken by a large population repeated measurements of the same quantity yields of corn, wheat, etc. HOW THE NORMAL DISTRIBUTION ARISES Example 1: cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 Consider the oranges stripped from an orange tree. They do not all have the same weight. This variation may be due to several factors which could include: black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\492SA12MA-2_08.CDR Thursday, 16 August 2007 4:49:52 PM DAVID3 SA_12MA-2 STATISTICS ² ² ² ² (Chapter 8) 493 different genetic factors different times when the flowers were fertilised different amounts of sunlight reaching the leaves and fruit different weather conditions (some may be affected by the prevailing winds more than others), etc. The result is that much of the fruit could have weights centred about, for example, a mean weight of 214 grams, and there are far fewer oranges that are much heavier or lighter. Invariably, a bell-shaped distribution of weights would be observed and the normal distribution model fits the data fairly closely. Example 2: In manufacturing nails of a given length, say 50 mm, the machines produce nails of average length 50 mm but there is minor variation due to random errors in the manufacturing process. A small standard deviation of 0:3 mm, say, may be observed, but once again a bell-shaped distribution models the situation. Once a normal model has been established we can use it to make predictions about the distribution and to answer other relevant questions. A TYPICAL NORMAL DISTRIBUTION A large sample of cockle shells were collected and the maximum distance across each shell was measured. Click on the video clip icon to see how a histogram of the data is built up. VIDEO CLIP Now click on the demo icon to observe the effect of changing the class interval lengths for normally distributed data. DEMO PROPERTIES OF NORMAL DISTRIBUTIONS INVESTIGATION 5 THE NORMAL CURVES PROPERTIES Click on the icon to obtain a sample from a normal distribution. NORMAL DISTRIBUTION What to do: 1 Find for n = 300, the sample’s mean (x), median and standard deviation (s). 2 Find the proportion of the sample values which lie in the intervals x § s, x § 2s, x § 3s. 3 Select another random sample for n = 200 and repeat 2. 4 Repeat again, each time recording your results. 5 Increase n to 1000 and obtain more data for proportions in the intervals described in 2. Repeat several times. cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 6 Write a brief report of your findings. black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\493SA12MA-2_08.CDR Thursday, 16 August 2007 4:49:58 PM DAVID3 SA_12MA-2 494 STATISTICS (Chapter 8) From the previous investigation you should have discovered that the normal distribution has certain properties or characteristics that enable valid statistical inferences to be made. Some of the properties are listed below. For ² ² ² the Normal distribution it can be shown that: 68% of the data will have values within one standard deviation of the mean. 95% of the data will have values within two standard deviations of the mean. 99:7% of the data will have values within three standard deviations of the mean. Graphically this can be summarised: 68% of data 95% of data x¡-¡s x¡+¡s mean x x¡-¡2s mean x 99.7% of data x¡-¡3s x¡+¡2s mean x x¡+¡3s These properties are illustrated on the normal distribution below: 50% 50% 34% 34% 13.5% 13.5% 2.35% 2.35% 0.15% 0.15% x¡ 3s x ¡2s x¡s x+ s x 68% x + 2s x + 3s 95% 99.7% Example 15 Self Tutor A company sells radios with a mean life of 18 months and a standard deviation of 3 months. The company will replace a radio if it is faulty within 12 months of sale. If they sell 5000 radios, how many can they expect to replace if life expectancy is normally distributed? We draw a rough picture of the normal distribution curve: 34% 13.5% 2.5% 3 magenta yellow 95 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 Thus 2:5% are expected to fail within 12 months, and 3 18 3 21 24 2:5% of 5000 = 0:025 £ 5000 + 125 100 12 cyan 3 15 black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\494SA12MA-2_08.CDR Thursday, 16 August 2007 4:50:04 PM DAVID3 SA_12MA-2 STATISTICS (Chapter 8) Example 16 495 Self Tutor The chest measurements of 18 year old male footballers is normally distributed with a mean of 95 cm and a standard deviation of 8 cm. a Find the percentage of footballers with chest measurements between: i 87 cm and 103 cm ii 103 cm and 111 cm b Find the probability that the measurement of a randomly chosen footballer is i more than 119 cm ii less than 87 cm c What chest measurement would put a footballer in the largest 2:5% of 18 year olds? We draw a rough sketch of the normal distribution curve and label with percentages: 34% 34% 13.5% 13.5% 2.35% 0.15% Let X cm be the chest measure2.35% 0.15% ment of a footballer. 71 a i ii b i ii c 79 87 95 103 111 119 Pr(87 < X < 103) = 34% + 34% = 68% Pr(103 < X < 111) = 13:5% Pr(X > 119) = 0:15% = 0:0015 Pr(X < 87) = 13:5% + 2:35% + 0:15% = 16% = 0:16 To be in the largest 2:5% of chest measurements, a footballer would need to have a chest of at least 111 cm. EXERCISE 8H.1 1 The following data are the heights, to the nearest centimetre, of thirty footballers that belong to an AFL club. 192 185 189 183 189 191 190 192 198 187 191 194 198 181 189 191 190 187 189 194 198 191 187 196 181 193 187 196 192 178 cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 a Find the i mean, x ii standard deviation, s of the height of the footballers in this club. b i Calculate the interval [x ¡ s, x + s]. ii What percentage of the heights would be expected to fall in this interval? iii What percentage of the actual heights fall in this interval? c What percentage of the actual heights fall in the interval [x ¡ 2s, x + 2s]? What percentage would you expect to fall in this interval? black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\495SA12MA-2_08.CDR Thursday, 16 August 2007 4:50:09 PM DAVID3 SA_12MA-2 496 STATISTICS (Chapter 8) 2 The distribution of weights of 600 g loaves of bread is bell-shaped with a mean weight of 605 g and a standard deviation of 8 g. What percentage of the loaves can be expected to have a weight between 597 g and 613 g? (Use the Normal distribution as a model.) 3 A restauranteur found that the average time spent by diners was 2 hours, with a standard deviation of 30 minutes. Assuming that the time spent by diners is normally distributed, and that there are 200 diners each week, calculate: a the number of diners who stay between 2 and 3 hours b the number who stay longer than 3 hours c the number who stay less than 1 12 hours. 4 A clock manufacturer did a survey of 800 of its clocks to find out how accurate they were. They found that the mean error was 6 minutes slow with a standard deviation of 2 minutes. Assuming that the error in time is normally distributed, find the expected number of clocks that are: a within 4 minutes of the mean error b between 4 and 8 minutes slow c more than 10 minutes slower than the correct time. 5 A bottle filling machine fills a mean of 20 000 bottles a day with a standard deviation of 2000. If we assume that production is normally distributed and the year comprises 260 working days, calculate to the nearest whole day, the number of working days that: a over 20 000 bottles are filled b over 16 000 bottles are filled c between 18 000 and 24 000 bottles are filled. 6 A battery manufacturer finds that its batteries have a mean life of 28 months with a standard deviation of 4 months. If the battery lives are normally distributed, and the company manufactures 40 000 batteries per annum, calculate: a the number of batteries that will last longer than 3 years. b If the company guarantees their batteries for 2 years, what number could they expect to replace under the guarantee? c If the company wanted to limit the claims under the guarantee to no more than 2:5% of production, what guarantee period would they need to put on their batteries? 7 The distribution of exam scores for 780 students who sat an exam is Normal with a mean of 55 and a standard deviation of 15. a Find the number of students who would be expected to obtain a score: i greater than 70 ii less than 55 iii less than 25 iv between 70 and 85 cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 b The bottom 16% of students will be given a ‘fail’. What is the cut-off mark for a ‘fail’? black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\496SA12MA-2_08.CDR Thursday, 16 August 2007 4:50:15 PM DAVID3 SA_12MA-2 STATISTICS 497 (Chapter 8) NORMAL DISTRIBUTION PROBABILITIES USING A CALCULATOR The previous questions were based around the standard 68% : 95% : 99:7% proportions. We can find other probabilities for the normal distribution using a graphics calculator. Suppose X is normally distributed with mean 10 and standard deviation 2. How do we find Pr(8 6 X 6 11) ? 8 10 11 For TI-83 For Casio Press 2nd VARS (DISTR) to bring up the DISTR menu and then 2 to select 2:normalcdf( From the Main Menu, select STAT. Press F5 (DIST) F1 (NORM) F2 (Ncd) Enter 8 for the lower bound, 11 for the upper bound, 2 for the standard deviation and 10 for the mean. The syntax for this command is normalcdf(lower bound, upper bound, x, s) Enter 8 , 11 , 10 , 2 ) Then select Execute. ENTER Thus the probability of X being between 8 and 11 is 0:533 (3 s.f.). cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 Thus the probability of X being between 8 and 11 is 0:533 (3 s.f.). black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\497SA12MA-2_08.CDR Monday, 20 August 2007 10:16:46 AM DAVID3 SA_12MA-2 498 STATISTICS (Chapter 8) FINDING QUANTILES (k-VALUES) USING A CALCULATOR Suppose we want to find k such that Pr(X 6 k) = 0:8 for the normal distribution alongside with mean 10 and standard deviation 2. For TI-83 80% 10 k For Casio k can be found using From the Main Menu, select STAT. Press F5 (DIST) F1 (NORM) mean invNorm (0:8, 10, 2) probability F3 (InvN) standard deviation Press 2nd VARS (DISTR) 3 to get invNorm( Enter 0.8 , 10 , 2 ) ENTER Enter 0.8 for the area, 2 for the standard deviation and 10 for the mean. Then select Execute. k = invNorm (0:8, 10, 2) + 11:7 (3 sf) So, k + 11:7 (3 sf) Example 17 Self Tutor The length of King George Whiting caught in SA is normally distributed with mean 38 cm and standard deviation 3:5 cm. What percentage of whiting caught would be expected to have a length of: a more than 40 cm b less than 33 cm c between 35 and 45 cm? d If the fisheries department wants to protect the smallest 30% of whiting from fishing, what size limit should be set? a Let X = length of a whiting. Pr(X > 40) = normalcdf (40, E99, 38, 3:5) = 0:284 i.e., 28:4% E99 is the largest number able to be entered into the calculator. b Pr(X < 33) = normalcdf (¡E99, 33, 38, 3:5) = 0:0766 i.e., 7:66% c Pr(35 < X < 45) = normalcdf (35, 45, 38, 3:5) + 0:782 i.e., 78:2% d Size limit = invNorm (0:3, 38, 3:5) + 36:2 cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 i.e., fish should be greater than 36:2 cm black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\498SA12MA-2_08.CDR Monday, 20 August 2007 10:16:55 AM DAVID3 SA_12MA-2 STATISTICS (Chapter 8) 499 EXERCISE 8H.2 1 The lengths of metal bolts produced by a machine are found to be normally distributed with a mean of 19:8 cm and a standard deviation of 0:3 cm. Find the probability that a bolt selected at random from the machine will be between 19:7 cm and 20 cm. 2 A student hoping to pass an examination is told that the mean mark of the class was 63 with standard deviation 17. If 20% of the class failed, what minimum mark must he have achieved, assuming the scores were normally distributed? 3 The IQ of secondary school students from a particular area is believed to be normally distributed with a mean of 103 and a standard deviation of 15:1. A student from one of the schools is randomly selected. Find the probability that this student will have an IQ: a of at least 115 b that is less than 75 c between 95 and 105: 4 A machine fills bags with icing sugar. The mean net weight per bag is 2 kg and the standard deviation is 0:1 kg. If 5% of the bags are rejected for being too heavy, and 10% of the bags are rejected for being too light, in what range must the weight of a bag lie for it to be acceptable, assuming the weight per bag is normally distributed? 5 The heights of men at an army barracks are found to be normally distributed with a mean of 181 cm and a standard deviation of 4 cm. A man is selected at random from this population. Find the probability that this person is: a at least 175 cm tall b between 177 cm and 180 cm tall. 6 The average score for a Physics test was found to be 46 and the standard deviation of the scores was 25. Assuming that the scores were normally distributed, the teacher decided to award an A to the top 7% of the students in the class. What is the lowest score that a student must obtain in order to achieve an A? 7 The average weekly earnings of the students at Hardtime High School are found to be approximately normally distributed with a mean of $40 and a standard deviation of $6: a What proportion of students would you expect to earn: i between $30:00 and $50:00 per week ii less than $50:00 per week? b A student is classified as ‘rich’ if they are in the top 10% of weekly earners. What weekly earnings are required to be classified as ‘rich’? c Discuss the reasonableness of a student earning more than $60 per week. d The average weekly earnings of the students at Comfy College have a mean of $25 and a standard deviation of $4. Sketch graphs on the same axes to show the earnings at the two schools. e What assumption was made about the earnings of students at Comfy College when drawing the graph in d? Is this reasonable? f If a ‘rich’ student from Hardtime High School transferred to Comfy College, how would the mean and standard deviation of earnings at Comfy College be affected? cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 8 The lengths of Murray cod caught in the River Murray are found to be normally distributed with a mean of 41 cm and a standard deviation of 3:3 cm. a Find the probability that a randomly selected cod is at least 50 cm long. b What proportion of cod measure between 40 cm and 50 cm? black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\499SA12MA-2_08.CDR Friday, 17 August 2007 1:22:32 PM DAVID3 SA_12MA-2 500 STATISTICS (Chapter 8) c In a sample of 200, how many of them would you expect to be less than 45 cm? The lengths of callop caught in the River Murray are also normally distributed with a mean of 32 cm and standard deviation 2:5 cm. d Jayco catches a Murray cod measuring 48 cm and a callop measuring 39 cm. Which of the two fish was the biggest catch, relative to their own species? e Liam claimed he caught a callop measuring 50 cm. Is this statistically reasonable? 9 Sam’s Maths mark is 83 in a class where the mean mark is 87 and the standard deviation is 4:1. The same group of students are in a Chemistry class where Sam’s mark is 58. The mean mark in Chemistry is 53 and the standard deviation is 7:3. In which subject did Sam perform better relative to the other members of the class? 10 The mean time on a netball court for each member of the team is 24 minutes when played outdoors and 27 minutes when played indoors. The standard deviations are 7:3 minutes and 8:4 minutes respectively. If Jan’s time on the court was 23 minutes outdoors and 25 minutes indoors, in which environment did she receive the most relative court time? I CORRELATION INTRODUCTION Often we wish to know how two variables are associated or related. To find such a relationship we construct and observe a scatterplot. A scatterplot consists of points plotted on a set of axes where the independent variable is placed on the horizontal axis and the dependent variable on the vertical axis. A typical scatterplot could look like one of the following: ² for the swimming team where weight is dependent on height. weight (kg) height (cm) ² for profitability of a sports goods store where the profit is often dependent on the amount of advertising done. profit ($) cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 advertising ($) black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\500SA12MA-2_08.CDR Thursday, 16 August 2007 4:50:39 PM DAVID3 SA_12MA-2 STATISTICS ² 501 (Chapter 8) for the intelligence quotient (IQ) of an individual. A sociologist may be considering if a person’s IQ is dependent on their weight. IQ weight OPENING PROBLEM 2 The relationship between weight and height of members of an AFL football team is being investigated.¡ We expect there to be a fairly strong association between these variables as it is generally perceived that the taller a person is, the more they will weigh.¡ The height and weight of each of the players in the team is recorded and these values form a coordinate pair for each of the players: Player 1 2 3 4 5 6 Height 203 189 193 187 186 197 Weight 106 93 95 86 85 92 Player 7 8 9 10 11 12 Height 180 186 188 181 179 191 Weight 78 84 93 84 86 92 Player 13 14 15 16 17 18 Height 178 178 186 190 189 193 Weight 80 77 90 86 95 89 Weight versus Height The scatterplot for the data is given alongside.¡ Height is the independent variable (horizontal axis) and weight is the dependent variable (vertical axis). weight (kg) 105 100 95 90 85 80 height (cm) 175 180 185 190 195 200 205 Consider and possibly discuss: cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 0 5 95 100 50 75 25 0 5 ² What are the variables in this problem and are they categorical or numerical? What is the dependent variable? Can you describe the appearance of the scatterplot? Are the points close to being linear? Does an increase in the independent variable generally cause an increase (or a decrease) in the dependent variable? 25 ² ² ² black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\501SA12MA-2_08.CDR Thursday, 16 August 2007 4:50:45 PM DAVID3 SA_12MA-2 502 STATISTICS (Chapter 8) MATHEMATICAL MODELLING A mathematical model consists of an equation which connects two or more variables. This equation may be exact or approximate, depending on the circumstances in which it arises. For example, assuming no abnormalities or loss due to accident, the total number of fingers f, for x people is given by f = 10x, clearly an exact rule. However, w + 0:9h ¡ 81 is a very approximate model for determining the weight w kg of AFL footballers of height h cm. This model is obtained by trying to fit a ‘line of best fit’ through the scatterplot points. Questions which could be asked where mathematical modelling may be used, could be similar to those following: ² Can tomorrow’s temperature be reasonably accurately predicted using today’s temperatures from country centres west of us? ² Can a company predict its increase in sales due to increased spending on advertising? ² Can a student’s success in a tertiary institution be predicted from his or her Year 12 final results? ² Can a person’s increase in intake of vitamins, particularly vitamin C, reduce one’s susceptibility to colds and influenza? ² Is there a relationship between a person’s age and their systolic blood pressure? In this section we will be concerned with trying to fit mathematical models to data obtained by observation or experiment. In particular, we will examine for variables x and y: linear models having form y = ax + b fa, b are constantsg CORRELATION Correlation refers to the relationship or association between two variables. In looking at the correlation between two variables we should follow these steps: Step 1: Look at the scatterplot for any pattern. cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 For a generally upward shape we say that the correlation is positive, and in this case an increase in the independent variable means that the dependent variable generally increases. black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\502SA12MA-2_08.CDR Thursday, 16 August 2007 4:50:50 PM DAVID3 SA_12MA-2 STATISTICS 503 (Chapter 8) For a generally downward shape we say that the correlation is negative and in this case an increase in the independent variable means that the dependent variable generally decreases. For randomly scattered points (with no upward or downward trend) there is usually no correlation. Step 2: Look at the spread of points to make a judgement about the strength of the correlation. For positive relationships we would classify the following scatterplots as: strong moderate weak Similarly there are strength classifications for negative relationships: strong Step 3: moderate Look at the pattern of points to see whether or not it is linear. These points appear to be roughly linear. These points do not appear to be linear. magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 Look for and investigate any outliers. These appear as isolated points away from the main body of data. Outliers should be investigated as sometimes they are mistakes made in recording the data or plotting it. Genuine extraordinary data should be included. 75 25 0 5 95 100 50 75 25 0 5 Step 4: cyan weak black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\503SA12MA-2_08.CDR Thursday, 16 August 2007 4:50:55 PM DAVID3 outlier not an outlier SA_12MA-2 504 STATISTICS (Chapter 8) POSITIVE CORRELATION An association between two variables is described as a positive correlation if an increase in one variable results in an increase in the other in an approximately linear manner. The association is best measured with the correlation coefficient (r) that ranges between 0 and 1 for positive correlation. An r value of 0 suggests that there is no linear association present (or no correlation). An r value of 1 suggests that there is a perfect linear association present (or perfect positive correlation). Only deterministic models will result in perfect correlation. For example, the association between the number of sides n, of a polygon and its interior angle sum, S, where S = (n ¡ 2) £ 180o . The correlation between the height and the weight of people is positive and lies between 0 and +1. It is not an example of perfect positive correlation because, for example, not all short people are of light weight. However, taller people are generally heavier than shorter people. The r values in between 0 and 1 represent varying degrees of linearity. Scatter diagrams for positive correlation: The scales on each of the four graphs are the same. y y y x y x r = +1 x r = +0.8 x r = +0.5 r = +0.2 NEGATIVE CORRELATION An association between two variables is described as a negative correlation if an increase in one variable results in a decrease in the other in an approximately linear manner. The strength of the association is best measured with the correlation coefficient (r) that ranges between 0 and ¡1 for negative correlation. An r value of ¡1 suggests that there is a perfect linear association present (or perfect negative correlation). Scatter diagrams for negative correlation: y y y x x x magenta yellow 95 100 50 75 25 0 5 95 r = -0.5 100 50 75 25 0 5 95 100 50 75 r = -0.8 25 0 5 95 100 50 75 25 0 5 r = -1 cyan y black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\504SA12MA-2_08.CDR Thursday, 16 August 2007 4:51:00 PM DAVID3 x r = -0.2 SA_12MA-2 STATISTICS We can interpret the Weight versus Height scatterplot from earlier as follows: “There is a moderate positive association between the variables height and weight.¡ This means that as height increases, weight increases.¡ The relationship appears linear and there are no obvious outliers.” 505 (Chapter 8) Weight versus Height weight (kg) 105 100 95 90 85 80 height (cm) 175 180 185 190 195 200 205 CAUSATION Correlation between two variables does not necessarily mean that one variable causes the other. Consider the following: 1 The arm length and running speed of a sample of young children were measured and a strong, positive correlation was found to exist between the variables. Does this mean that short arms cause a reduction in running speed or that a high running speed causes your arms to grow long? These are obviously nonsense assumptions and the strong positive correlation between the variables is attributed to the fact that both arm length and running speed are closely related to a third variable, age. Arm length increases with age as does running speed (up to a certain age). 2 The number of television sets sold in Ballarat and the number of stray dogs collected in Bendigo were recorded over several years and a strong positive association was found between the variables. Obviously the number of television sets sold in Ballarat was not influencing the number of stray dogs collected in Bendigo. Both variables have simply been increasing over the period of time that their numbers were recorded. If a change in one variable causes a change in the other variable then we say that a causal relationship exists between them. For example: The age and height of a group of children is measured and there is a strong positive correlation between these variables. This will be a causal relationship because an increase in age will cause an increase in height. cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 In cases where this is not apparent, there is no justification, based on high correlation alone, to conclude that changes in one variable cause the changes in the other. black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\505SA12MA-2_08.CDR Thursday, 16 August 2007 4:51:06 PM DAVID3 SA_12MA-2 506 STATISTICS (Chapter 8) EXERCISE 8I.1 1 For each of the following, state whether you would expect to find positive, negative, or no association between the following variables. Indicate the strength (none, weak, moderate or strong) of the association. a Shoe size and height. b Speed and time taken for a journey. c The number of occupants in a household and the water consumption of the household. d Maximum daily temperature and the number of newspapers sold. e Age and hearing ability. 2 Copy and complete the following: a If the variables x and y are positively associated then as x increases, y .......... b If there is negative association between the variables m and n then as m increases, n .......... c If there is no association between two variables then the points on the scatterplot appear to be .......... .......... 3 Describe, briefly, exactly what is meant by: a a scatterplot b a mathematical model d positive correlation e negative correlation c f correlation an outlier a What is meant by the independent and dependent variables? b When graphing, which variable is placed on the horizontal axis? 4 5 For the following scatterplots comment on: i the existence of any pattern (positive, negative or no association) ii the relationship strength (zero, weak, moderate or strong) iii whether the relationship is linear or not iv whether or not there are any outliers. a b c y y y x e y magenta yellow 95 50 75 25 0 5 95 100 50 x 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 x cyan x f y 100 d x black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\506SA12MA-2_08.CDR Thursday, 16 August 2007 4:51:12 PM DAVID3 y x SA_12MA-2 STATISTICS (Chapter 8) 507 6 The following pairs of variables were measured and a strong positive correlation between them was found. Discuss whether a causal relationship exists between the variables. If not, suggest a third variable to which they may both be related. a The lengths of one’s left and right feet. b The damage caused by a fire and the number of firemen who attend it. c Company expenditure on advertising, and sales. d The height of parents and the height of their adult children. e The number of hotels and the number of churches in rural towns. MEASURING CORRELATION When dealing with linear association we can use the concept known as correlation to measure the strength and direction of association. Correlation is a technique that was devised to measure the strength and direction of the linear association between two variables. The correlation between two numerical variables can be measured by a correlation coefficient. There are several correlation coefficients that can be used, but the most widely used coefficient is Pearson’s correlation coefficient, named after the statistician Carl Pearson who developed it. Its full name is Pearson’s product-moment correlation coefficient, and it is denoted r. The correlation coefficient (r) lies between ¡1 and 1. Constructing a scatterplot and finding Pearson’s correlation coefficient x y We consider finding Pearson’s correlation coefficient for the data opposite: 1 2 2 1 3 4 4 3 5 5 6 6 7 5 8 5 9 7 10 8 Using a Texas Instruments TI-83 First we activate the diagnostic tools.¡ Once turned on these will remain on, but if the memory is cleared or the battery changed then the calculator will revert back to the default functions that do not include r.¡ To activate the diagnostic tools: Locate the menu CATALOG using 2nd 0 . Use the arrow keys to scroll down to DiagnosticOn and press ENTER . cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 DiagnosticOn will appear on the screen. Press ENTER and you will have turned the diagnostic tools on. black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\507SA12MA-2_08.CDR Monday, 20 August 2007 10:18:39 AM DAVID3 SA_12MA-2 508 STATISTICS (Chapter 8) Enter the data into lists, the x-data into L1 and the y-data into L2. Y= (STAT PLOT), then press ENTER Press 2nd Turn Plot1 On and select the scatterplot icon ". The XList is for the independent variable L1 and the YList is for the dependent variable L2. Press ZOOM 9 (9:ZoomStat) to view the scatterplot.¡ You can press TRACE and use the arrow keys to identify the points. We check the scatterplot at this stage as it will reveal any errors made in entering the data, and any outliers.¡ It will also indicate whether the data is linear. Press STAT 4 to select 4:LinReg(ax+b) from the STAT CALC menu. (This means we are fitting a linear model or linear regression of the form y = ax + b to the data. Regression will be discussed in greater detail soon!) LinReg(ax + b) appears on the screen.¡ You need to tell the calculator where your data is: Enter L1, L2 by pressing 2nd 1 (L1) , ENTER . 2nd 2 (L2) The linear regression screen appears and the last figure r = 0:9130 :::: is Pearson’s correlation coefficient for this data set. The r value indicates a strong positive correlation, which agrees with the scatterplot. Using a Casio fx-9860g Enter the data into lists, the x-data into List 1 and the y-data into List 2. Press F6 (¤) until the GRPH icon is in the bottom left corner of the screen. cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 Press F1 (GRPH) F6 (SET), then press F1 (Scat) to select the scatterplot. Make sure the XList is set to List1 and the YList is set to List2, then press EXIT F4 (SEL) F1 (On) to turn StatGraph1 on. black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\508SA12MA-2_08.CDR Monday, 20 August 2007 10:19:30 AM DAVID3 SA_12MA-2 STATISTICS (Chapter 8) 509 Press F6 (Draw) to draw the scatterplot Press F1 (X) to obtain a linear regression of the data. The linear regression screen appears and the figure r = 0:9130 ......¡ is Pearson’s correlation coefficient for this data set. The r value indicates a strong positive correlation, which agrees with the scatterplot. Notes about Pearson’s correlation coefficient: ² It is designed for linear data only. ² It should be used with caution if there are outliers. For example, the data in the two scatterplots below both have a correlation coefficient of r = 0:8. The presence of the outlier in the second graph has greatly reduced the r value, however, without this point, r would equal 1. y 15 y outlier 15 10 10 5 5 x x 2 4 6 8 10 12 14 2 4 6 8 10 12 Example 18 Self Tutor In attempting to find if there is any association average speed between average speed in the metropolitan area and age of drivers, a device was fitted 70 to cars of drivers of different ages. The results are shown in the scatterplot. 60 The r value for this association is +0:027. Describe the association. 50 20 30 40 50 60 70 80 90 age cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 As r is close to zero, there is no correlation between the two variables. black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\509SA12MA-2_08.CDR Monday, 20 August 2007 10:19:40 AM DAVID3 SA_12MA-2 510 STATISTICS (Chapter 8) Example 19 Self Tutor We construct a scatterplot: no. of lawn beetles Wydox have been trying out a new chemical to control the number of lawn beetles in the soil.¡ Determine the extent of the correlation between the quantity of chemical used and the number of surviving lawn beetles per square metre of lawn. Lawn Amount of chemical (g) Number of surviving lawn beetles A B C D E 2 5 6 3 9 11 6 4 6 3 We now fit a linear model to the data: chemical From the scatterplot and r + ¡0:859, we have a moderate negative association between the amount of chemical used and the number of lawn beetles surviving. Generally, the more chemical used, the less beetles survive. EXERCISE 8I.2 1 Mr Whippy thought that there may be a relationship between the temperature and the number of ice-creams he sells. He collected the following data: Max. daily temp. (o C) 29 40 35 30 34 34 27 27 19 37 22 19 25 36 23 No. of ice119 164 131 152 206 169 122 143 63 208 155 96 125 248 139 creams sold a Use your calculator to sketch a scatterplot and calculate Pearson’s correlation coefficient. b Interpret the value of r in terms of strength and direction. c Does the value of the correlation coefficient confirm your observations from the scatterplot? Was it appropriate to find r for this data? Explain. cyan magenta yellow 70 45 50 32 110 33 100 41 60 50 55 30 80 45 50 36 75 23 95 56 35 100 80 47 50 25 0 85 26 40 39 75 60 20 5 95 100 50 10 17 75 15 17 25 30 38 0 22 17 5 80 38 100 Min. spent Score 95 110 55 50 65 38 75 35 30 25 30 31 0 75 25 5 95 Min. spent Score 100 50 75 25 0 5 2 A class of 25 students was asked to record their times (in minutes) spent preparing for a test. The data below was collected: black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\510SA12MA-2_08.CDR Thursday, 16 August 2007 4:51:33 PM DAVID3 18 34 SA_12MA-2 STATISTICS 511 (Chapter 8) a Use your calculator to sketch a scatterplot and calculate Pearson’s correlation coefficient. b Interpret the value of r in terms of strength and direction. c Does the value of the correlation coefficient confirm your observations from the scatterplot? Was it appropriate to find r for this data? Explain. 3 Which one of the following is true for Pearson’s correlation coefficient r? A The addition of an outlier to a set of data would always result in a lesser value of r. B An r value of 1 represents a stronger relationship between the variables than an r value of ¡1. C A high value of r means that one variable is causing the other variable to change. D An r value of ¡0:8 means that as the independent variable increases, the dependent variable will tend to decrease. E It can take values between 0 and 1 inclusive. THE COEFFICIENT OF DETERMINATION (r2 ) To help describe the strength of association we calculate the coefficient of determination (r2 ). This is simply the square of the correlation coefficient (r) and as such the direction of association is eliminated. value 2 r =0 no correlation 0 < r2 < 0:25 very weak correlation 2 Many texts vary on the advice they give. We suggest the rule of thumb given alongside when describing the strength of linear association. strength of association 0:25 6 r < 0:50 weak correlation 0:50 6 r2 < 0:75 moderate correlation 0:75 6 r2 < 0:90 strong correlation 2 0:90 6 r < 1 very strong correlation r2 = 1 perfect correlation CALCULATION OF THE COEFFICIENT OF DETERMINATION r2 is found on the linear regression screen of your calculator as shown opposite. STATISTICS PACKAGE Alternatively, if the value of r is known, then this can simply be squared. INTERPRETATION OF THE COEFFICIENT OF DETERMINATION r2 indicates the strength of association between the dependent variable and the independent variable. cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 If there is a causal relationship then r2 indicates the degree to which change in the independent variable explains change in the dependent variable. black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\511SA12MA-2_08.CDR Thursday, 16 August 2007 4:51:39 PM DAVID3 SA_12MA-2 512 STATISTICS (Chapter 8) For example: An investigation into many different brands of muesli found that there is strong positive correlation between the variables fat content and kilojoule content. Pearson’s correlation coefficient, r, was found to be 0:8625. The coefficient of determination for this study is (0:8625)2 + 0:744. An interpretation of this r2 value is “the proportion of variation in kilojoule content that can be explained by the variation in fat content of muesli is 0:744.” It is usual to quote the coefficient of variation as a percentage. A proportion of 0:744 is equivalent to 0:744 £ 100 = 74:4%. The interpretation becomes: dependent variable 74:4% of the variation in kilojoule content of muesli can be explained by the variation in fat content of muesli. independent variable If 74:4% of the variation in kilojoule content of muesli can be explained by the fat content of muesli then we can assume that the other 100%¡74:4% = 25:6% of the variation in kilojoule content of muesli can be explained by other factors (which may or may not be known). Example 20 Self Tutor A study has found that 45% of the variation in selling price can be explained by the variation in age of a used car. If this statement was based on the coefficient of variation then what would be the value of Pearson’s correlation coefficient for this study? p We are told that r2 = 0:45 so r is the square root of 0:45. ( 0:45 + 0:6708) At this point we need to consider the variables involved: selling price and age of a car. We would assume that as the age of a car increases then the selling price of a car would decrease, i.e., there is negative correlation between the variables. Hence we can conclude for this study that Pearson’s correlation coefficient, r, will be ¡0:6708. ‘Casualty crashes’ v ‘All crashes’ casualty crashes EXERCISE 8I.3 1 The scatterplot alongside shows the association between the number of car crashes in which a casualty occurred and total number of car crashes in South Australia in each year from 1985 to 2007. Given that the r value is 0:49: 10000 8000 6500 6000 30000 35000 40000 45000 50000 55000 95 100 50 75 all crashes 25 0 5 95 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 100 yellow 8500 7000 a find r b describe the association between these variables. magenta 9000 7500 2 cyan 9500 black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\512SA12MA-2_08.CDR Thursday, 16 August 2007 4:51:45 PM DAVID3 SA_12MA-2 STATISTICS 2 In an investigation to examine the association between the tread depth (y mm) and the number of kilometres travelled (x thousand), a sample of 8 tyres of the same brand was taken and the results are given below. 14 5:7 kilometres (x thousand) tread depth (y mm) 17 6:5 24 4:0 34 3:0 35 1:9 513 (Chapter 8) depth of tread tyre cross-section 37 2:7 38 1:9 39 2:3 a Draw a scatterplot of the data. b Calculate r and r2 for the tabled data. c Describe the association between tread depth and the number of kilometres travelled for this brand of tyre. 3 In an investigation the coefficient of determination for the variables preparation time and exam score is found to be 0:5624. Complete the following interpretation of the coefficient of determination: ...... % of the variation in .......... can be explained by the .......... in preparation time. 4 For each of the following find the value of the coefficient of determination correct to four decimal places, and interpret it in terms of the variables. a An investigation has found the association between the variables time spent gambling and money lost has an r value of 0:4732. b For a group of children a product-moment correlation coefficient of ¡0:365 is found between the variables heart rate and age. c In a study of a sample of countries, Pearson’s correlation coefficient for the variables female literacy and gross domestic product is found to be 0:7723. 5 A rural school has investigated the relationship between the time spent travelling to school (minutes) and a student’s year ten average (%) for a sample of students. The results are given in the table below: Travel time 10 33 18 43 34 30 24 47 44 41 17 45 39 31 23 11 14 25 16 17 (mins) Year 10 51 78 97 56 90 70 64 67 37 46 95 67 31 57 43 99 98 82 40 67 average (%) a Construct a scatterplot of the data and interpret the scatterplot. b Find Pearson’s correlation coefficient for the data and interpret. c Calculate the coefficient of determination and interpret this in terms of the variables. J LINEAR REGRESSION Regression is a word that means fitting a line or curve to a set of data. cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 A curve that is fitted to a set of paired numerical data gives us an algebraic relationship between the variables that can be used to predict values of one variable given values for the other. black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\513SA12MA-2_08.CDR Thursday, 16 August 2007 4:51:52 PM DAVID3 SA_12MA-2 514 STATISTICS (Chapter 8) In this course, we only consider linear regression. Linear regression is fitting a line to the set of data. If there is a linear relationship between the variables, the regression line accurately models the relationship between them. Let us revisit Opening Problem 2.¡ We know that there is quite a strong positive correlation between the height and the weight of the players.¡ Consequently, we should be able to find a linear equation which ‘best fits’ the data.¡ This line of best fit could be found by eye.¡ However, different people will use different lines.¡ So, how do we find mathematically, the line of best fit? Weight versus Height weight (kg) 105 100 95 90 85 80 height (cm) 175 180 185 190 195 200 205 LEAST SQUARES REGRESSION y The least squares regression line is a line drawn so that the sum of the squares of the vertical distance from each point on the scatterplot (the dotted lines) is a minimum. y = mx + c (xc, yc) Statisticians invented a method where the best line results.¡ (xz, yz) (xv, yv) c DEMO (xx, yx) x The least squares regression line has form y = ax + b, x y a b where is is is is the the the the variable on the horizontal axis variable on the vertical axis slope or gradient of the line y-intercept of the line. FINDING AND PLOTTING THE LEAST SQUARES REGRESSION LINE x y Consider the data alongside: 55 72 20 37 27 53 33 74 73 73 18 44 37 59 51 55 79 84 For TI-83 Enter the data into lists L1 and L2 and check its scatterplot. Press STAT 4 to select 4:LinReg(ax+b) from the STAT CALC menu. cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 Enter L1 and L2, by pressing 2nd 1 (L1) , 2nd 2 (L2) , , then press VARS 1 to select 1:Function from the Y-VARS menu, then 1 to select 1:Y1 from the FUNCTION menu. black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\514SA12MA-2_08.CDR Monday, 20 August 2007 10:19:50 AM DAVID3 SA_12MA-2 STATISTICS 515 (Chapter 8) Press ENTER to view the equation of the regression line. The gradient ‘a’ and the y-intercept ‘b’ of the least squares regression line are given. The equation of the regression line is y = 0:572x + 36:2 : Reminder: If the values for r2 and r are not shown, press 2nd 0 (CATALOG), choose DiagnosticOn, and press ENTER . Now press GRAPH to display the regression line on the scatterplot. Note: The equation of the regression line has been pasted into Y1 and can now be used to make predictions if appropriate. For Casio Enter the x-data into List 1 and the y-data into List 2, and view its scatterplot. Press F1 (X) to view the regression line. The gradient ‘a’ and the y-intercept ‘b’ of the least squares regression line are given. The equation of the regression line is y = 0:572x + 36:2 : Press F6 (DRAW) to display the regression line on the scatterplot. cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 Note: Pressing F1 (X) F5 (COPY) EXE will paste the equation of the regression line into Y1, where it can be used to make predictions if appropriate. black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\515SA12MA-2_08.CDR Monday, 20 August 2007 10:20:00 AM DAVID3 SA_12MA-2 516 STATISTICS (Chapter 8) INTERPRETING THE SLOPE AND INTERCEPT OF A REGRESSION LINE The data below gives the fat content (grams) and the energy (kilojoules) of 17 different foods. Fat (g) Energy (kJ) 15 1255 55 3555 18 1800 45 1880 17 1670 24 2520 30 2300 30 2300 Fat (g) Energy (kJ) 16 1340 11 1130 9 1150 30 2300 24 1670 24 1670 30 2510 32 1460 4000 A scatterplot of the data is shown alongside.¡ The data shows moderate positive correlation between the variables energy and fat content, and there appears to be a linear relationship between the variables. 30 2090 energy (kJ) 3000 2000 1000 0 0 10 20 30 40 fat (g) 50 60 The regression equation is y = 41:9x + 834 (3 s.f.) i.e., energy = 41:9 £ fat + 834, in which the gradient or slope is 41:9 and the y- or energy-intercept is 834. The slope can be interpreted as: “for every increase of one gram of fat there is an increase of 41:9 kilojoules of energy”. unit of independent variable gradient or slope independent variable unit of dependent variable dependent variable The y-intercept has the value 834. This can be interpreted as: “when the fat content of a food is zero, the energy provided by the food is 834 kilojoules”. This interpretation is reasonable because it is possible for food to have zero fat, and most foods will still have energy content from carbohydrates such as sugars. INTERPOLATION AND EXTRAPOLATION Interpolation means predicting values from a regression model (equation) for values from within the range of data from which the regression equation was based. cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 In the example above, the fat data ranged from 9 to 55. Using a value within this range to predict an energy value would be interpolation. If we predict that the energy content of a food containing 40 g of fat is 2509 kilojoules, this is interpolation. black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\516SA12MA-2_08.CDR Thursday, 16 August 2007 4:52:08 PM DAVID3 SA_12MA-2 STATISTICS 517 (Chapter 8) Extrapolation means predicting values from a regression model (equation) for values from outside the range of data from which the regression equation was based. Using a value for fat outside the range 9¡g 4000 dependent upper pole to 55¡g would be extrapolation.¡ Using the equation to predict the energy content of 3000 a food containing 60¡g of fat or 5¡g of fat 2000 would both be cases of extrapolation.¡ line of The accuracy of an interpolation depends 1000 best fit lower pole on how linear the original data was.¡ This independent can be gauged by determining the corre0 0 10 20 30 40 50 60 lation coefficient and ensuring that the data is randomly scattered around the line extrapolation extrapolation interpolation of best fit. The accuracy of an extrapolation depends not only on ‘how linear’ the original data was, but also on the assumption that the linear trend will continue past the poles. The validity of this assumption depends greatly on the situation under investigation. Example 21 Self Tutor The table below shows the sales for Hancock’s Electronics established in late 2000. 2001 5 Year Sales ($ £ 10 000) a b c d 2004 18 2005 21 2006 27 Let t be the time in years from 2000 and S be the sales in $10 000’s, i.e., t 1 2 3 4 5 6 S 5 9 14 18 21 27 30 25 20 15 10 5 S t 1 2 3 4 5 6 magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 Using technology, in 2008, t = 8 ) S + 34:95 i.e., predicted year 2008 sales would be $350 000. The r and r2 values suggest that the linear relationship between sales and year is very strong and positive. However, since this prediction is an extrapolation, it will only be reasonable if the trend evident from 2001 to 2006 continues to the year 2008, and this may or may not occur. 5 d 95 The line of best fit is S = 4:29t + 0:667 : 100 c 50 Using technology, r2 = 0:9941: 75 b 25 0 5 95 100 50 75 25 0 5 2003 14 Draw a graph to illustrate this data. Find r2 : Find the equation of the line of best fit using the linear regression formula. Predict the sales figures for year 2008, giving your answer to the nearest $10 000. Comment on the reasonableness of this prediction. a cyan 2002 9 black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\517SA12MA-2_08.CDR Thursday, 16 August 2007 4:52:14 PM DAVID3 SA_12MA-2 518 STATISTICS (Chapter 8) EXERCISE 8J 1 Recall the tread depth data of car tyres after travelling thousands of kilometres: 14 5:7 kilometres (x thousand) tread depth (y mm) a b c d 17 6:5 24 4:0 34 3:0 35 1:9 37 2:7 38 1:9 39 2:3 Which is the dependent variable? On a scatterplot graph the least squares regression line and state its equation. Use the equation of the line of best fit to estimate the tread depth of a new tyre. Brock claims that his tyres have done 50 000 km. Is this claim reasonable? Give evidence. 2 Tomatoes are sprayed with a pesticide-fertiliser mix. The figures below give the yield of tomatoes per bush for various spray concentrations. 3 67 Spray concentration (x, mL/L) Yield of tomatoes per bush (y) a b c d e f g h 5 90 6 103 8 120 9 124 11 150 15 82 Define the role of each variable and produce an appropriate scatterplot. Determine the value of r and r2 and interpret. Is there an outlier present that is contributing to the low correlation? Remove the outlier from the data set and recalculate r and r2 . Is it reasonable to now draw a line of best fit? Determine the equation of the line of best fit. Give an interpretation for the slope and vertical intercept of this line. Use the equation of the least squares line to predict the yield if the spray concentration was 7 mL/L. Comment on the reasonableness of this prediction. If a 50 mL/L spray concentration was used, would this ensure a large tomato yield? Explain. 3 The table below shows the concentration of chemical X in the blood of an accident victim at various times after an injection was administered. Time (minutes) 10 20 30 40 50 60 70 Concentration (micrograms/mL) 105 38:0 13:1 4:75 1:42 0:63 0:12 a Sketch a scatterplot of the data. b Calculate the r and r2 values and interpret. c Is the relationship between the variables strong enough to warrant drawing a least squares regression line? 4 A restauranteur believes that during March the number of people wanting dinner (y) is related to the temperature at noon (xo C). Over a period of a fortnight the number of diners and the noon temperature were recorded. Temperature (xo C) 23 25 28 30 30 27 25 28 32 31 33 29 27 26 Number of diners (y) 57 64 62 75 69 58 61 78 80 67 84 73 76 67 cyan magenta yellow 95 100 50 75 25 0 b Generate a scatterplot of the data. 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 a What is the independent variable? black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\518SA12MA-2_08.CDR Thursday, 16 August 2007 4:52:20 PM DAVID3 SA_12MA-2 STATISTICS 519 (Chapter 8) c Calculate r and r2 and interpret. d How accurate would an interpolation using the line of best fit be? Explain. e Are there any obvious outliers that could be removed to improve the correlation? 5 It has long been thought that frosty conditions are necessary to ‘set’ the fruit of cherries and apples.¡ The following data shows annual cherry yield and incidence of frosts data for a cherry growing farm over a 7 year period. Number of frosts, (x) 27 23 7 37 32 14 16 Cherry yield (y tonnes) 5:6 4:8 3:1 7:2 6:1 3:7 3:8 ©iStockphoto a b c d e f Draw a scatterplot for this data. Determine the r and r2 value. Describe the association between cherry yield and the number of frosts. Determine the equation of the line of best fit. Give an interpretation for the slope and vertical intercept of this line. Use the equation of the least squares line to predict the cherry yield if 29 frosts were recorded. Comment on the reasonableness of this prediction. g Use the equation of the least squares line to predict the cherry yield if 1 frost was recorded. Comment on the reasonableness of this prediction. 6 The rate of a chemical reaction in a certain plant depends on the number of frost-free days experienced by the plant over a year which, in turn, depends on altitude. The higher the altitude, the greater the chance of frost. The following table shows the rate of the chemical reaction R, as a function of the number of frost-free days, n. 75 44:6 Frost-free days (n) Rate of reaction (R) 100 42:1 125 39:4 150 57:0 175 34:1 200 31:2 a Produce a scatterplot for the data of R against n. b Is it reasonable to draw a regression line? Give r2 evidence. Clearly, the data point (150, 57:0) is an outlier. Inspection of records reveals that it should be (150, 37:0). c Change the outlier to its correct value and hence find the equation of the regression line which best fits the data. State the new value of r2 . d Estimate the rate of the chemical reaction when the number of frost free days is: i 90 ii 215: e Complete: “The higher the altitude, the ...... the rate of reaction.” 7 The following table gives peptic ulcer rates per 100 of population for differing family incomes in the year 2007. 15 7:7 20 6:9 cyan magenta yellow 30 5:9 40 4:7 50 3:6 60 2:6 80 1:2 95 50 75 b Find the line of best fit. 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 a Obtain a scatterplot of the data. 25 7:3 100 10 8:3 Income (I thousand $) Peptic ulcer rate R black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\519SA12MA-2_08.CDR Thursday, 16 August 2007 4:52:27 PM DAVID3 SA_12MA-2 520 STATISTICS (Chapter 8) c What is the estimated peptic ulcer rate in families with $45 000 incomes? d Explain why the model is inadequate for families with income in excess of $100 000. 8 The concentration of carbon dioxide (CO2 ) in the atmosphere at Port Adelaide has been recorded over a 40 year period. CO2 concentration in the atmosphere has a large influence over our weather. CO2 concentration is measured in parts per million. Consider the table which follows: Year 1965 1970 1975 1980 1985 1990 1995 2000 2005 CO2 concentration 313 316 320 326 329 335 340 338 334 Let t be the number of years since 1965 and C be the CO2 concentration. a Sketch a scatterplot of the data. b Does a linear model appear to be appropriate? Explain. The data for 2000 and 2005 is checked and found to be accurate as CO2 levels have decreased due to environmental awareness.¡ A researcher wishes to estimate the CO2 concentration for 1993. c Delete the data for 2000 and 2005 and find the linear model that fits the 1965 to 1995 data. State the value of r2 . d Use the model to predict the CO2 level in 1993. e According to the model, what would the CO2 level have been in 2005 if a decrease in levels had not occurred? f Is it reasonable to use the 2000 and 2005 data to predict the CO2 level in 2020? 9 Safety authorities advise drivers to travel 3 seconds behind the car in front of them as this provides the driver with a greater chance of avoiding a collision if the car in front has to brake quickly or is itself involved in an accident.¡ A test was carried out to find out how long it would take a driver to bring a car to rest from the time a red light was flashed.¡ (This is called stopping time, which includes reaction time and braking time.) The following results are for one driver in the same car under the same test conditions. Speed (v km/h) 10 20 30 40 50 60 70 80 90 Stopping time (t secs) 1:23 1:54 1:88 2:20 2:52 2:83 3:15 3:45 3:83 a b c d cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 Produce a scatterplot of the data. Find the linear model which best fits the data. Is the linear model a good fit? Give evidence. Use the model to find the stopping time for a speed of: i 55 km/h ii 110 kmph e What is the interpretation of the vertical intercept? f Why does this simple rule apply at all speeds, with a good safety margin? black Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\520SA12MA-2_08.CDR Thursday, 16 August 2007 4:52:34 PM DAVID3 SA_12MA-2