Download Descriptive Statistics

Descriptive Statistics (Level IV Graduate Math) Draft (NSSAL) C. David Pilmer ©2011 (Last Updated: Dec 2011) This resource is the intellectual property of the Adult Education Division of the Nova Scotia Department of Labour and Advanced Education. The following are permitted to use and reproduce this resource for classroom purposes. • Nova Scotia instructors delivering the Nova Scotia Adult Learning Program • Canadian public school teachers delivering public school curriculum • Canadian nonprofit tuition-free adult basic education programs The following are not permitted to use or reproduce this resource without the written authorization of the Adult Education Division of the Nova Scotia Department of Labour and Advanced Education. • Upgrading programs at post-secondary institutions • Core programs at post-secondary institutions • Public or private schools outside of Canada • Basic adult education programs outside of Canada Individuals, not including teachers or instructors, are permitted to use this resource for their own learning. They are not permitted to make multiple copies of the resource for distribution. Nor are they permitted to use this resource under the direction of a teacher or instructor at a learning institution. Acknowledgments The Adult Education Division would also like to thank the following NSCC instructors for reviewing this resource and offering suggestions during its development. Eileen Burchill (IT Campus) Nancy Harvey (Akerley Campus) Eric Tetford (Burridge Campus) Tanya Tuttle-Comeau (Cumberland Campus) Alice Veenema (Kingstec Campus) Table of Contents Introduction…………………………………………………………………………... Negotiated Completion Date…………………………………………………………. The Big Picture………………………………………………………………………. Course Timelines…………………………………………………………………….. ii ii iii iv Populations and Samples ……………………………………………………………. Tables ………………………………………………………………………………... Types of Data ……………………………………………………………………….. Bar Graphs and Histograms ………………………………………………………… Circle Graphs and Line Graphs ……………………………………………………… First Impressions ……………………………………………………………………. Second Impressions …………………………………………………………………. What Type of Graph Should be Used ………………………………………………. Mean, Median, Mode, and Trimmed Mean …………………………………………. Box and Whisker Plots ………………………………………………………………. Using Technology to Make Box and Whisker Plots ………………………………… Standard Deviation …………………………………………………………………... Using Technology to Calculate Population Standard Deviation …………………….. Distributions …………………………………………………………………………. Normal Distributions and the 68-95-99.7 Rule ……………………………………… Z-Scores ……………………………………………………………………………… Growth Charts ……………………………………………………………………….. Putting It Together …………………………………………………………………… 1 3 5 7 15 20 22 24 26 34 41 46 52 57 60 68 80 85 Appendix Area Under the Normal Curve (z-Table) …………………………………………….. Weight-for-Age Percentiles: Boys …………………………………………………... Length-for-Age Percentiles: Boys …………………………………………………… Head Circumference-for-Age: Boys ………………………………………………… Post-Unit Reflections ………………………………………………………………… Answers ……………………………………………………………………………… 96 97 98 99 100 101 NSSAL ©2011 i Draft C. D. Pilmer Introduction Statistics is the discipline concerned with the collection, organization, and analysis of data to draw conclusions or make predictions. Statistics is widely employed in government, business, and the natural and social sciences. In this unit we will focus on descriptive statistics; the branch of statistics that deals with the description of data. In the first part of the unit, we will look at the different ways data can be presented using graphs (e.g. bar graphs, histograms, circle graphs, line graphs,…) and how these graphs can be interpreted. In the next part of the unit we will learn how to determine and interpret measures of central tendency and standard deviation. In descriptive statistics, we must differentiate between two important terms; population and sample. A population is the set representing all measurements of interest to an investigator. A sample is a subset of measurements selected randomly from the population of interest. It is probably easier to look at these terms in the following way. Suppose you wanted to know the average income of working adults in your community. If you asked every working adult in the community, then you are dealing with the population. If, however, you randomly selected and interviewed only a portion of the working adults in your community, then you are dealing with a sample. For the sake of simplicity, this unit will only focus on populations. For example, if one of the questions supplies student scores on a test, you will assume that these scores represent all the student scores, not a randomly selected portion of the scores. The other branch of statistics that we have not discussed is inferential statistics. In the case of inferential statistics one makes inferences about population characteristics based on evidence drawn from samples. Translated you take a random sample from a population and use the information collected from that small sample to make a prediction about the much larger population. For example if you wanted to know how much time Nova Scotian adults between the ages of 20 years and 40 years of age spent watching television on weekdays, it would be impractical to collect data from every NS adult in that age group. It would be very challenging, time-consuming, and expensive. It would make more sense to randomly select 300 adults from that age group, collect the data, analyze the data, and use that data to predict the average number of hours all NS adults in that age group view television on weekdays. Although inferential statistics is an extremely important branch of statistics, it goes beyond what is needed for a graduate level math course. Inferential statistics is, however, examined in the Academic Level IV Math course. Negotiated Completion Date After working for a few days on this unit, sit down with your instructor and negotiate a completion date for this unit. Start Date: _________________ Completion Date: _________________ Instructor Signature: __________________________ Student Signature: NSSAL ©2011 __________________________ ii Draft C. D. Pilmer The Big Picture The following flow chart shows the five required units and the four optional units (choose two of the four) in Level IV Graduate Math. These have been presented in a suggested order. Math in the Real World Unit (Required) • Fractions, decimals, percents, ratios, proportions, and signed numbers in real world applications • Career Exploration and Math Solving Equations Unit (Required) • Solve and check equations of the form Ax + B = Cx + D , A = Bx 2 + C , and A = Bx 3 + C . Consumer Finance Unit (Required) • Simple Interest and Compound Interest • TVM Solver (Loans and Investments) • Credit and Credit Scores Graphs and Functions Unit (Required) • Understanding Graphs • Linear Functions and Line of Best Fit Measurement Unit (Required) • Imperial and Metric Measures • Precision and Accuracy • Perimeter, Area and Volume Choose two of the four. Linear Functions and Linear Systems Unit Trigonometry Unit Statistics Unit ALP Approved Projects (Complete 2 of the 5 projects.) Note: You are not permitted to complete four ALP Approved Projects and thus avoid selecting from the Linear Functions and Linear Systems Unit, Trigonometry Unit, or Statistics Unit. NSSAL ©2011 iii Draft C. D. Pilmer Course Timelines Graduate Level IV Math is a two credit course within the Adult Learning Program. As a two credit course, learners are expected to complete 200 hours of course material. Since most ALP math classes meet for 6 hours each week, the course should be completed within 35 weeks. The curriculum developers have worked diligently to ensure that the course can be completed within this time span. Below you will find a chart containing the unit names and suggested completion times. The hours listed are classroom hours. Unit Name Minimum Completion Time in Hours 24 20 18 28 24 20 20 Total: 154 hours Math in the Real World Unit Solving Equations Unit Consumer Finance Unit Graphs and Functions Unit Measurement Unit Selected Unit #1 Selected Unit #2 Maximum Completion Time in Hours 36 28 24 34 30 24 24 Total: 200 hours As one can see, this course covers numerous topics and for this reason may seem daunting. You can complete this course in a timely manner if you manage your time wisely, remain focused, and seek assistance from your instructor when needed. NSSAL ©2011 iv Draft C. D. Pilmer Populations and Samples As we learned in the introduction, descriptive statistics is concerned with the description of data. This means that we look at methods that organize data and summarize data in an effective presentation that ultimately increases our understanding of the data. In the same introduction, we learned about populations and samples. A population is the set representing all measurements of interest to an investigator. A sample is a subset of measurements selected randomly from the population of interest. The relationship between a sample and population can be represented by the diagram on the right where the sample is a small portion of the population. With the exception of this small section of the unit, we are only going to focus on populations. Population Sample Example 1 The Testing and Evaluation Division of the Department of Education reported that the average mark on the grade 12 provincial math exam was 68%. This average was obtained by randomly selecting 500 exams from throughout the province. Are we dealing with a sample or a population? Explain. Answer: The Testing and Evaluation Division randomly selected 500 exams, rather than every exam. For this reason they were dealing with a sample (i.e. a subset of the population). Example 2 Statistics Canada had all households complete the long-form census. They reported that the average salary, after tax, of unattached individuals in 2009 was $31 500. Are we dealing with a sample or a population? Explain. Answer: Since every household, which would include every unattached individual, was reporting, then we are dealing with a population (i.e. all measurements of interest). Questions: 1. The town’s mayor is interested in knowing what portion of her 4127 taxpayers support the development of a new recreational center in the community. Because it is too costly to contact all the taxpayers, a survey of 300 randomly selected taxpayers is conducted. Describe the population and sample for this problem. NSSAL ©2011 1 Draft C. D. Pilmer 2. A building contractor just purchased 6000 used bricks. He knows that a small portion of these bricks are cracked and therefore unusable. He randomly selected 200 bricks and discovered that 14 of them were unusable. Describe the population and sample for this problem. 3. A company conducted a phone survey that involved 1200 randomly selected employed workers from Nova Scotia. Each participant had to report their annual gross income. At the time (2009) it was known that there were 453 000 employed workers in Nova Scotia. After conducting the survey and analyzing the data, the company reported an average annual income of 29 900 for the 1200 participants. Describe the population and sample for this problem. 4. Between 2001 and 2009, 3730 adults obtained high school diplomas through the Nova Scotia School for Adult Learning (NSSAL). The Nova Scotia government wanted to know how many of these adults pursued further education after obtaining their diploma. After interviewing 240 randomly selected graduates, it was discovered that 65% had pursued post secondary education primarily at the Nova Scotia Community College. Describe the population and sample for this problem. NSSAL ©2011 2 Draft C. D. Pilmer Tables Investigation: The Fringe Movie Festival A small privately owned multiplex movie theatre has decided to host a fringe movie festival. Over the weekend, they are showing "cheesy" prequel movies that are obvious parodies of the original blockbusters. The following table shows the number of tickets sold for each movie over the weekend. They have broken the tickets into three categories: senior, adult, and child tickets. Movie Jaws: The Teething Years Terminator: Rise of the Toasters Star Wars: Episode 0 Avatar: Evolving from the Blue Man Group Transformers: The Horse and Buggy Years Senior Tickets 158 33 133 51 62 Adult Tickets 349 412 341 409 350 Child Tickets 54 47 146 136 122 Use the table to answer the following questions. 1. Which movie had the greatest number of child viewers? 2. Which movie had the greatest number of viewers during the festival? How did you arrive at this answer? 3. Which movie had the fewest number of viewers during the festival? 4. Based solely on ticket sales, what movie appeared to be most popular by both seniors and adults? How did you arrive at this answer? 5. Based solely on ticket sales, what movie appeared to be least popular by both seniors and adults? 6. Could you quickly answer the questions above? Besides a table, what other way could the data be displayed so that you can more efficiently address the questions? NSSAL ©2011 3 Draft C. D. Pilmer 7. Here is the stacked bar graph corresponding to the fringe movie festival ticket sales data. Number of Tickets Sold 700 600 500 Child Tickets 400 Adult Tickets 300 Senior Tickets 200 Avatar:Evolving from the Blue Man Group Star Wars: Episode 0 Terminator: Rise of the Toasters Jaws: The Teething Years 0 Transformers: The Horse and Buggy Years 100 What are your thoughts regarding presenting the data in this graphical form? 8. Was the fringe movie festival data collected on the previous page derived from a sample or a population? Justify your answer. NSSAL ©2011 4 Draft C. D. Pilmer Types of Data In the last section we learned that data is often easier to understand if it is expressed as a graph instead of a table. Before we can look at all the different ways data can be displayed in graphical form (e.g. line graphs, circle graphs, histograms, …), we need to take a few minutes and learn about the different types of data. These different types influence the type of graph that can be used. When data is collected, the responses can be classified as a categorical data set or a numerical data set. These two terms are most easily explained using an example. Suppose we have an adult education class comprised of 10 learners who all have cell phones. The instructor asks two questions and obtains the following responses. Question 1: What cell phone provider do you use? Responses to Question 1: {Telus, Bell Aliant, Telus, Bell Aliant, Rogers, Rogers, Koodo, Rogers, Telus, Rogers} Question 2: What was your cell phone bill for the previous month? Responses to Question 2: {$27.80, $33.50, $45.70, $32.00, $54.90, $29.00, $43.65, $67.40, $35.89, $39.67} The collection of responses to the first question is called a categorical data set. Categorical data is data that can be assigned to distinct non-overlapping categories. The responses to question 1 fit into four categories; Bell Aliant, Koodo, Rogers and Telus. The collection of responses to the second question is called a numerical data set. This is the case because the data is comprised of numbers, specifically different amounts of money. There are two types of numerical data; discrete and continuous. Numerical data is discrete if the possible values are isolated points on a number line. For example, if survey participants were asked how many phone calls they made today, their responses would be whole numbers like 0, 4 or 12. They would not respond with something like 7.8 phone calls. Since they can only report isolated points, then we end up with discrete numerical data. Numerical data is continuous if the set of possible values forms an entire interval on the number line. For example, if soil samples were tested for acidity, the pH could be reported with numbers like 4, 4.17, 4.173, or any other number in the interval. Generally continuous data arises when observations involve making measurements (e.g. weighing objects, recording temperatures, recording time to complete tasks,…) while discrete data arises when observations involve counting. NSSAL ©2011 5 Draft C. D. Pilmer Question: 1. For each of the following, state whether the data collection would result in a categorical data set or numerical data set. If the data is numerical, indicate whether we are dealing with discrete or continuous data. (a) (b) Concentration in parts per million (ppm) of a particular contaminant in water supplies Brand of personal computer purchased by customers (c) The sex of children born at the IWK Hospital in December (d) The height of male adult education learners at a specific campus The number of children in each household. (e) (f) (g) (h) (i) (j) The gross income of adult workers between the ages of 25 and 35 in Nova Scotia The races of people immigrating to Canada The time it takes for females between the ages of 20 and 30 to complete the 100 m dash The sum of the numbers rolled on two dice (k) The amount of gas purchased by individual UltraCan customers on a specific day The size of shoe purchased by teenage males (l) The destination city or town for summer vacations (m) The head circumference of a newborn child (n) NSSAL ©2011 The country of manufacture for vehicles in the staff parking lot at the NSCC Waterfront Campus 6 Draft C. D. Pilmer Bar Graphs and Histograms Bar graphs and histograms look very similar so learners often get them confused. Bar graphs are used to display categorical data or discrete numerical data. The bars in bar graphs are separated from one another. Examples of bar graphs are shown below. Bar Graph #1 In this survey, 60 randomly selected Australian students were asked to report in which month they were born. Bar Graph #2 In this survey, 200 randomly selected international students were asked which hand they write with. Histograms are used to display continuous numerical data where the data is organized into classes. The bars on a histogram are not separated from one another. Histogram #1 In this survey, 100 randomly selected students from all over the world were asked to report how long it took to travel from home to school. In this case the class width is 5. The first class goes from 0 to 5, not including five. The second class goes from 5 to 10, not including 10. NSSAL ©2011 Histogram #2 Forty randomly selected secondary students from Canada were asked to report their heights in centimeters. As with Histogram #1, the class width in this case is 5 however the intervals do not start and end on multiples of 5. For example the first class showing a value is centered at 120. That means that this class goes from 117.5 to 122.5, not including 122.5. 7 Draft C. D. Pilmer Transformers: The Horse and Buggy Years Avatar:Evolving from the Blue Man Group Star Wars: Episode 0 45 40 35 30 quantity sold Double bar graphs allow one to present more than one kind of information, situation, or event in one graph, instead of drawing two separate bar graphs. One of the most common uses is to simultaneously display data for both males and females. The example on the right shows how the coffee purchasing decisions for males and female differ at a particular coffee shop on a particular morning. Terminator: Rise of the Toasters Jaws: The Teething Years Number of Tickets Sold Bar graphs also come in different forms; 700 two of the most common are stacked bar 600 graphs and double bar graphs. We have 500 already been exposed to stacked bar Child Tickets 400 graphs when we completed the Adult Tickets 300 Senior Tickets questions regarding the fringe movie 200 festival in the section titled "Tables." 100 On a stacked bar graph the bars are 0 divided into categories so that we can compare the parts to the whole. In the case of the fringe movie festival graph, the bars were divided into three categories: senior tickets, adult tickets, and child tickets. By doing this we can quickly see how those three types of tickets sales contributed to the overall sales for each movie. 25 male 20 female 15 10 5 0 small coffee It should be mentioned that in all the bar graph examples we have provided to this point, the bars have been oriented vertically. Bar graphs can also be drawn such that the bars are in a horizontal orientation. That is what we have done with the stacked bar graph on the right which was obtained using the data from the fringe movie festival. large coffee Transformers: The Horse and Buggy Years Avatar:Evolving from the Blue Man Group Senior Tickets Star Wars: Episode 0 Adult Tickets Child Tickets Terminator: Rise of the Toasters Jaws: The Teething Years 0 NSSAL ©2011 medium coffee 8 100 200 300 400 500 600 700 Draft C. D. Pilmer Example 1 Anne tracked the additional time, in minutes, she spent outside of regular class time to work on her five courses, over two days (Wednesday and Thursday). That information is displayed in the graph below. Minutes of Additional Work 40 35 30 25 Wednesday 20 Thursday 15 10 5 lo gy So cio to ry Hi s at h M un ica t io ns Co m m Bi ol og y 0 (a) How much time did she spend on Thursday doing additional work in History? (b) In what subject and on what day did she spend 25 minutes doing additional work? (c) In what subject did she spend the same amount of time on Wednesday and Thursday doing additional work? (d) How much more time did she spend on Wednesday doing additional work in Math compared to Thursday? (e) How much more time did she spend on Thursday doing addition work in Biology compared to History? (f) How much time over the two days did she spend doing additional work in Biology and Communications? Answers: (a) 10 minutes (b) Math on Thursday (c) Sociology (She spent 15 minutes each day) (d) Math Wednesday: 30 minutes Math Thursday: 25 minutes 30 - 25 = 5 minutes (e) Biology Thursday: 20 minutes History Thursday: 10 minutes 20 - 10 = 10 minutes (f) 15 + 20 + 20 + 35 = 90 minutes or 1.5 hours NSSAL ©2011 9 Draft C. D. Pilmer Example 2 Thirty-six randomly selected males between the ages of 20 and 29 years of ages were weighed. The weights in pounds are shown below. 210 143 194 174 203 181 224 171 178 186 182 186 188 215 192 182 194 174 166 177 192 188 191 167 207 189 155 178 162 202 160 193 181 188 181 196 (a) Construct a histogram with class widths of 10 starting at 140. (b) What percentage of the randomly selected males weighed less than 180 pounds? Answers: (a) Construct a table to organize the data in terms of the classes. The first class is from 140 to 150 includes 140 but does not include 150. Class 140 to 150 Tally Frequency 1 150 to 160 1 160 to 170 4 170 to 180 6 180 to 190 11 190 to 200 7 200 to 210 3 210 to 220 2 220 to 230 1 Now construct the histogram. (b) Out of the 36 participants, 12 weighed less than 180 pounds. 12 1 × 100 = 33 % 36 3 NSSAL ©2011 10 Draft C. D. Pilmer Questions 200 180 160 Number of Fans (in millions) 1. A study was conducted to see which major league sport is most popular. In the study, they looked at how many fans (in millions) each sport has. The information is displayed using a bar graph. Acronyms: NFL: National Football League NBA: National Basketball Association MLB: Major League Baseball NHL: National Hockey League NASCAR: National Association for Stock Car Auto Racing 140 120 100 80 60 40 20 0 NFL NBA MLB NHL NASCAR (a) Which sport is most popular amongst the fans? (b) Approximate the number of fans the National Hockey League has. (c) Which major league sport has 120 million fans? (d) Approximately how many more fans does the NFL have compared to the NBA? (e) Is this a bar graph or histogram? 2. The medal counts for the 2006 and 2010 winter Olympics for four countries have been provided in the following graph. Norway Germany 2010 2006 United States Canada 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 number of medals (a) What type of graph are we dealing with? NSSAL ©2011 11 Draft C. D. Pilmer (b) Of the four countries, which had highest medal count in 2006? (c) What was the medal count for the United States in 2010? (d) Which country had a medal count of 19 in 2006? (e) How many more medals did Canada obtain in 2010 compared to 2006? (f) In 2010, how many more medals did the United States get compared to Germany? (g) What was the total medal count all four countries in 2010? (h) What was the total medal count for both Germany and the United States over the 2006 and 2010 winter Olympics? 3. The Canadian Nurses Association reported the age distribution of all registered nurses (RNs) in Canada for the year 2009. This data was used the construct the following graph. 40000 35000 Number of RNs 30000 25000 20000 15000 10000 5000 0 <24 25-29 30-34 35-39 40-44 45-49 50-54 55-59 60-64 65+ Age Source: Canadian Institute for Health Information (a) What type of graph are we dealing with? (b) What type of data was used to construct this graph? (c) Approximately how many registered nurses in 2009 were between the ages 30 and 39? NSSAL ©2011 12 Draft C. D. Pilmer (d) In 2009, approximately how many more 55 to 59 year old RNs are there compared to 60 to 64 year old RNs? (e) What three classes of ages had the greatest number of RNs in 2009? (f) Considering that Canada has an aging population, what potential problem is likely to occur in the near future based on the information supplied in this graph. 1400 1300 1200 1100 1000 900 800 700 600 500 400 300 200 100 0 Inpatient Days Consult Visits 10 20 09 / 20 09 20 08 / 20 08 20 07 / 20 07 20 20 06 / 20 05 / 20 20 04 / 20 06 New Consults 05 Number of Cases 4. The Nephrology and Hypertension Department of the Children's Hospital in London, Ontario reported the number of cases they addressed over the different fiscal years (i.e. from April 1 of one year to March 31 of the next year). They broke the cases into three categories: new consults, consult visits, and inpatient days. New consults refer to cases that have been referred by an outside source (typically a family doctor) to the department. With each case, the information in the patient's medical file is reviewed to see if the patient needs can be served by the department. Consult visits refer to day clinic visits by patients. Inpatient days refer to hospital stays by patients whose immediate needs cannot be met by day clinic visits. Fiscal Year Source: University of Western Ontario, Department of Paediatrics (a) What type of graph are we dealing with? (b) Were there significant changes in the number of new consults to the Nephrolopgy and Hypertension Department over the six fiscal years? NSSAL ©2011 13 Draft C. D. Pilmer (c) Approximately how many cases were dealt with in the 2008/2009 fiscal year? (d) Approximately how many consult visits were dealt with in 2004/2005? (e) Approximately how many cases involving inpatient visits were addressed in 2005/2006? (f) Approximately how many more cases involving consult visits occurred in 2006/2007 compared to 2005/2006? (g) What was the big shift from 2008/2009 to 2009/2010? 5. Thirty randomly selected families of four were asked how much they spent on their last family meal at a restaurant. The following data was obtained. 70 68 62 86 78 67 94 82 75 74 66 103 65 97 64 68 80 83 67 71 77 72 69 64 90 72 78 66 64 86 (a) Construct a histogram with class widths of 5 starting at 60. Reminder that the class 60 to 65 does not include the number 65. The 65 is in the next class. Class 60 to 65 65 to 70 70 to 75 75 to 80 80 to 85 85 to 90 90 to 95 95 to 100 100 to 105 Tally Frequency (b) What percentage of the families spent $90 or more on their meal? (c) What type of data are we dealing with? (d) Are we dealing with a sample or population? NSSAL ©2011 14 Draft C. D. Pilmer Circle Graphs and Line Graphs Circle graphs, also called pie charts, are divided into sectors where each sector represents part of a whole. Each sector is proportional in size to the amount each sector represents. For example if 70 out of 140 people responded that their favorite ice cream was chocolate, then the "chocolate" sector of the circle graph would be 50% or half of the circle graph. Example 1 In 1999, registered nurses were asked to report where they were employed. The results are presented in the circle graph on the right. At the time there were 229 000 registered nurses in Canada. Community Health Agency 8% Other 16% Home Care 4% Nursing Home 12% Source: Registered Nurses Database Not Stated (a) What percentage of registered nurses 1% worked in nursing homes in 1999? (b) Approximately how many registered nurses worked in hospitals in 1999? (c) Approximately 9160 RNs were employed in Hospital what sector? 59% (d) Approximately how many RNs were employed in either home care or nursing homes? (e) Approximately how many more RNs were employed in hospitals than in community health agencies? (f) What is the ratio of RNs employed in community health agencies to nursing home? Answers: (a) 12% (b) 59% of 229 000 0.59 × 229 000 = 135 110 RNs (c) 9160 × 100 = 4% These RNs are working in home care. 229000 (d) 4% + 12% = 16% 16% of 229 000 0.16 × 229 000 = 36 640 RNs (e) 59% - 8% = 51% 51% of 229 000 0.51 × 229 000 = 116 790 RNs (f) community health agency nursing home 8 8÷4 2 ← desired ratio = = 12 12 ÷ 4 3 Line graphs are created by plotting data points and connected them with lines. These lines are useful for showing trends; that is, how something changes in value as something else happens. NSSAL ©2011 15 Draft C. D. Pilmer Example 2 This line graph shows how the fertility rate in Canada has changed since 1950. The fertility rate is the average number of children born of women between the ages of 15 and 49. 4.5 4 3.5 Fertility Rate 3 Source: Statistics Canada (a) What was the approximate fertility rate in 1970? (b) In what year was the fertility rate approximately 3.2? (c) How much did the fertility rate drop by between 1960 and 1970? (d) After 1960, when did the fertility rate increase? 2.5 2 1.5 1 0.5 0 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 Year Answers: (a) 2.3 (b) 1965 (c) 3.9 - 2.3 = 1.6 The fertility rate dropped by approximately 1.6 (d) It only increased slightly between 1985 and 1990. Questions 1. The following circle graph was constructed using data collected from all patients over a one month period at a specific emergency room. That month 1200 patients visited the site. home injuries 19% respiratory problems 9% heart attacks 14% miscellaneous 7% w ork injuries 24% auto accidents 27% (a) What is the leading cause of emergency room visits to this location during this month? (b) How many more times likely was the staff at this emergency room going to see a patient injured in an auto accident compared to a patient having respiratory problems? (c) How many patients suffering from work related injuries sought treatment at the emergency room? NSSAL ©2011 16 Draft C. D. Pilmer (d) How many more patients sought treatment for heart attacks compared to patients suffering from respiratory problems? (e) Which one of the following represents the ratio of patients with worked related injuries to patients suffering from heart attacks? (Multiple Choice) 7 12 (i) (ii) 12 7 14 27 (iv) (iii) 27 14 (f) What was the cause of emergency room visits for 228 patients? 2. The following graph shows the value of Canada's exports from January 2008 until November 2010. The values are expressed in millions of Canadian dollars; for example the number 20,000 on the vertical scale represents $20,000 million dollars or $20 billion dollars. 50,000.00 Exports in Millions of Dollars 45,000.00 40,000.00 35,000.00 30,000.00 25,000.00 20,000.00 15,000.00 10,000.00 5,000.00 J Fe anbr 0 8 u M ar y ar c Ap h r M il ay Ju n J e Se Au uly pt gu e s O mb t c No to er b De vem er ce be m r Ja be Fe n- r br 0 9 u M ar y ar c Ap h r M il ay Ju n J e Se Au uly pt gu e s O mb t No cto er b De vem er ce be m r J be Fe an- r br 1 0 u M ar y ar c Ap h r M il ay Ju n J e Se Au uly pt gu e s O mb t c No to er ve be m r be r 0.00 Source: Statistics Canada (a) Name at least three periods when Canada's exports largely remained unchanged. NSSAL ©2011 17 Draft C. D. Pilmer (b) During what month and year did Canada's exports almost reach $45 billion dollars? (c) When were Canada's exports lowest between Jan-08 and Nov-10? (d) Approximately how much did exports drop by between October 2008 and January 2009? Based on your knowledge of world events, why do you think this occurred? 3. There were 725 housing starts in the first quarter of 2011 in Nova Scotia. These starts were broken into four categories: single detached (i.e. single dwelling homes), semi-detached (i.e. single-family home that is joined on one side to another home), row housing (i.e. townhouse), and apartments. Single Detached, 293 Apartments, 337 Semi-detached, 60 Row Housing, 35 Source: Canada Mortgage and Housing Corporation (a) What percentage of the housing starts was for single detached homes? (b) What is the ratio of row housing starts to semi-detached starts? (c) How many more apartment starts were there compared to the combined row housing and semi-detached starts? NSSAL ©2011 18 Draft C. D. Pilmer (d) The Canada Mortgage and Housing Corporation predicts that the second quarter housing starts in Nova Scotia will increase from 725 to 850. If they assume that the proportion of single detached starts remains the same from the first quarter to the second, how many single detached starts do they anticipate in this second quarter? Value of RIM Sotck ($) 4. The value of stock changes over time. The following line graph shows how the Research in Motion (RIM) stock changed over the month of June in 2011. Notice that the month is comprised of 22 days, rather than 30. There were only 22 trading days in June 2011; stocks are not traded on weekends. 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Trading Day Source: Nasdaq.com (a) On what trading day was the greatest single day loss in the value of RIM shares during the month of June? Approximate the amount that was lost per share on that day. (b) By how much approximately did the stock drop by from the beginning of the month until the end of the month? (c) On what trading day was the greatest single day gain in the value of RIM shares during the month of June? Approximate the amount that each share increased by on that day. NSSAL ©2011 19 Draft C. D. Pilmer First Impressions Part 1 Grocery store customers were asked to identify their favorite brand of ice cream. Once the data was collected, a circle graph was constructed. It is shown on the right. Jen and Berry Ice Cream Charmer Dairies Ice Cream Faxter Ice Cream What is your first impression regarding customer's preferences for particular brands of ice cream? Part 2 The 2001 population counts for five urban centres in Canada were used to construct this graph. 140000 130000 120000 Source: Statistics Canada Population What is your first impression regarding the population counts for these centres? 110000 100000 90000 80000 70000 60000 50000 Lethbridge, AB NSSAL ©2011 20 Moncton, NB Nanaimo, BC Sarnia, ON TroisRiveres, QC Draft C. D. Pilmer 100% 80% Percentage Part 3 The owners of an amusement park kept track of the number of male and female patrons that used four particular rides in the park on a weekday morning. They used the data to construct the following graph. What is your first impression regarding the patron usage of these rides? 60% Females Males 40% 20% 0% Hurl-a-Twirl Source: Statistics Canada What is your first impression regarding the change in the price of a domestic flight Bumper Boats Zip Line 210 Average Domestic Fare Part 4 The following line graph shows how the average price of a domestic flight from Halifax changed between the first quarter of 2007 until the third quarter of 2010. Death Drop 200 190 180 170 III II IV 01 0 I-2 II III IV 00 9 I-2 II III II III IV 00 8 I-2 I-2 00 7 160 Quarters NSSAL ©2011 21 Draft C. D. Pilmer Second Impressions We are going to re-examine some of the real world applications that we were exposed to in the section titled "First Impressions." In part 1 of First Impressions, we looked at a circle graph regarding customer's preference for particular brands of ice cream. We have redrawn the circle graph using the same data. Based on this new perspective of the circle graph, have your first impressions changed? Why or why not? Charmer Dairies Ice Cream 36% Jen and Berry Ice Cream 36% Faxter Ice Cream 28% 140000 130000 120000 110000 100000 Population In part 2 of First Impressions, we looked at a bar graph regarding 2001 population counts for five Canadian urban centres. We have redrawn the graph using the same data. Based on this new graph, has your first impression changed? Why or why not? 90000 80000 70000 60000 50000 40000 30000 20000 10000 0 Lethbridge, AB NSSAL ©2011 22 Moncton, NB Nanaimo, BC Sarnia, ON TroisRiveres, QC Draft C. D. Pilmer 250 200 Number of People In part 3 of First Impressions, we looked at a stacked bar graph regarding the patron usage of four specific rides in an amusement park. We have redrawn the graph using the same data. Based on this new graph, has your first impression changed? Why or why not? 150 Females Males 100 50 0 Hurl-a-Twirl Bumper Boats Zip Line 250 Average Domestic Fare In part 4 of First Impressions, we looked at a line graph regarding the average price of domestic flights from Halifax. We have redrawn the graph using the same data. Based on this new graph, has your first impression changed? Why or why not? Death Drop 200 150 100 50 II III 01 0 I-2 III IV II IV 00 9 I-2 II III 00 8 III IV II I-2 I-2 00 7 0 Quarters Why did we bother exposing you to the two versions of each of these graphs? NSSAL ©2011 23 Draft C. D. Pilmer What Type of Graph Should Be Used? Below you have been provided with data tables. Indicate what type of graph (histogram, line, circle, bar, double bar, or stacked bar graph) you would use for this data. In a few cases, there can be more than one acceptable answer. 1. Graph Type: _______________________ Favorite Music Genre Pop Rock Hip Hop Country Blues Other Male Female 90 120 70 100 50 70 150 70 60 120 40 60 3. Graph Type: _______________________ time (s) 0 2 4 6 8 Anne Jane Denise Meera Yoshi NSSAL ©2011 Swim Time (min) 10 12 13 11 10 Bike Time (min) 55 54 58 53 53 Graph Type: _______________________ Favorite Movie Genre Action Comedy Drama Horror Science Fiction Other 4. distance (m) 0 1.6 3.2 4.8 6.4 Percentage 32 18 15 8 21 6 Graph Type: _______________________ Time Commuting to Work (min) 0 - 10 10 - 20 20 - 30 30 - 40 > 40 5. Graph Type: _______________________ Triathlon Athlete 2. 6. 24 27 39 58 43 12 Graph Type: _______________________ Blood Type A+ AO+ OB+ BAB+ AB- Run Time (min) 35 37 40 39 41 Frequency Percentage 35.7 6.3 37.4 6.6 8.5 1.5 3.4 0.6 Draft C. D. Pilmer 7. Graph Type: _______________________ Town Amherst Digby Kentville Pictou Port Hawkesbury Population in 2006 9505 2092 5812 3813 3517 9. Graph Type: _______________________ Salaries in Thousands of Dollars 15 - 25 25 - 35 35 - 45 45 - 55 55 - 65 65- 75 more than 75 NSSAL ©2011 8. Graph Type: _______________________ Television Program Type Comedy Drama Reality Audience Share (%) 1996 - 1997 2001 - 2002 12 13 10 8 9 8 10. Graph Type: _______________________ Number of Employees 16 43 57 48 23 11 6 Year 1997 1998 1999 2000 2001 2002 2003 25 Cell Phone Revenues (Billions of Canadian Dollars) 3.3 4.4 4.6 5.4 6.0 7.2 8.1 Draft C. D. Pilmer Mean, Median, Mode, and Trimmed Mean Charlie looks at the marks his Level IV Graduate Math learners earned in a particular unit over the last year. {81, 74, 91, 82, 79, 95, 78, 92, 86, 74, 78, 69, 84, 77, 88, 78, 71} He wants to report how well his students performed on this particular unit without having to supply all seventeen pieces of data. He could use a histogram to display the results but he decides instead to calculate two measures of central tendency: the mean (arithmetic average) and median (middle). Mean The most common measure of central tendency is the arithmetic average, or mean. When calculating a mean, statisticians differentiate between population means and sample means by using different symbols. The procedure for calculating either of these means is identical. The population mean and sample mean are calculated by adding all the data points and then dividing up the number of data points. µ= x1 + x 2 + x3 + ... + x n n where µ (mu) is the population mean x= x1 + x 2 + x3 + ... + x n n where x (x bar) is the sample mean Although in later sections of this unit, we are only going to concentrate on populations, in this section we will ask you to know both formulas, specifically the two symbols ( µ and x ) used to represent the different means. Let's return to Charlie’s math marks. Since he is looking at the marks of all of the learners who completed the unit, he is dealing with a population. The population mean, µ , is calculated below. x1 + x 2 + x3 + ... + x n n 81 + 74 + 91 + 82 + 79 + 95 + 78 + 92 + 86 + 74 + 78 + 69 + 84 + 77 + 88 + 78. + 71 µ= 17 1377 µ= 17 µ= µ = 81 The mean mark for Charlie’s learners on this unit is 81%. NSSAL ©2011 26 Draft C. D. Pilmer Median The mean is not the only way to describe the center. Another method is to use the “middle value” of the data which is called the median. The median separates the higher half of the data from the lower half. The median can be calculated in the following manner. 1. Arrange the data points in order of size, from smallest to largest. 2. If the number of data points is odd, then the median is the data point in the middle of the ordered list. 3. If the number of data points is even, then the median is the mean of the two data points that share the middle of the ordered list. Return to Charlie’s math marks. The median is calculated by following the procedure provided below. Order the data points from smallest to largest 69, 71, 73, 74, 77, 78, 78, 78, 79, 81, 82, 84, 86, 88, 91, 92, 95 Since we have an odd number of data points (n = 17), then median will be in the middle data point of the ordered list. 69, 71, 74, 74, 77, 78, 78, 78, 79, 81, 82, 84, 86, 88, 91, 92, 95 The median will be 79. Suppose we had another instructor, Angela, who had sixteen learners who completed the same unit. She has recorded the marks that they made and worked out the mean and median. {99, 94, 80, 63, 77, 99, 68, 62, 95, 78, 66, 93, 65, 64, 98, 95} Mean: x + x + x3 + ... + x n µ= 1 2 n 99 + 94 + 80 + 63 + 77 + 99 + 68 + 62 + 95 + 78 + 66 + 93 + 65 + 64 + 98 + 95 µ= 16 1296 µ= 16 µ = 81 The mean mark for these learners on this unit is 81%. Median: Order the data points from smallest to largest 62, 63, 64, 65, 66, 68, 77, 78, 80, 93, 94, 95, 95, 98, 99, 99 Since the number of data points is even (n = 16), then the median is the mean of the two data points that share the middle of the ordered list. 62, 63, 64, 65, 66, 68, 77, 78, 80, 93, 94, 95, 95, 98, 99, 99 78 + 80 Median = = 79 2 NSSAL ©2011 27 Draft C. D. Pilmer Is the Mean and Median Enough? These measures of central tendency often do not give us a complete understanding of the data set because they do not give any indication how the data is spread out. This is especially evident when we look at the means and medians for the two groups of math students previously discussed. Although the means and medians are identical for Charlie's and Angela's learners, the marks earned by the two groups are vastly different. • In Charlie’s group, the majority of students earned marks between 71 and 88. There was only one mark in the sixties and only three marks in the nineties. The marks are clustered together. • In Angela's group, learners could largely be divided into two groups; learners who did very well (i.e. obtained marks in the high 90's) and learners who found the material challenging (i.e. obtained marks in the 60's). The marks are not clustered together as they were with Charlie's learners. Range of Marks 60 to 65 65 to 70 70 to 75 75 to 80 80 to 85 85 to 90 90 to 95 95 to 100 Number of Charlie's Learners 0 1 3 5 3 2 2 1 Number of Angela's Learners 3 3 0 2 1 0 2 5 It is important to note that our two measures of central tendency, mean and median, did not reveal this important difference between the two data sets. We will address this issue in a later section of this unit. When are the Mean and Median Not Close to Each Other? There are times when the mean and median may not be close to each other. One case is if an outlier exists within the data set. An outlier is a data point that falls outside the overall pattern of the data set. Consider the following data set where the data points have already been arranged in ascending order. {2.8, 3.0, 3.0, 3.1, 3.2, 3.4, 3.4, 3.5, 3.5, 3.6, 3.7, 3.9, 4.0, 4.2, 16.7} Notice that all but one data point is between 2.8 and 4.2. The mean for this data set is 4.3 and the median is 3.5. It is obvious that in this case the median is a far better measure of central tendency than the mean. The outlier, 16.7, greatly influenced the mean to a point where it no longer accurately represented the center of the data set. The extreme sensitivity of the mean to even a single outlier and the insensitivity of the median to outliers led to the development of trimmed means. Trimmed means are calculated by ordering NSSAL ©2011 28 Draft C. D. Pilmer the data points from smallest to largest, deleting a selected number of points from both ends of the ordered list, and finally averaging the remaining numbers. For example to calculate the 5% trimmed mean, the bottom 5% of the data points and the top 5% of the data points are deleted. Consider the data set at the top of the page. We will calculate the 5% trimmed mean for this data set. If 5% of the number of data points (i.e. 5% of 15) is 0.75, we would round up to 1 (round to nearest whole number). Since we obtained a 1, we would drop one data point from the bottom and one data point from the top of the data set. 2.8, 3.0, 3.0, 3.1, 3.2, 3.4, 3.4, 3.5, 3.5, 3.6, 3.7, 3.9, 4.0, 4.2, 16.7 Finally we work out the mean of the remaining thirteen data points. 3.0 + 3.0 + 3.1 + 3.2 + 3.4 + 3.4 + 3.5 + 3.5 + 3.6 + 3.7 + 3.9 + 4.0 + 4.2 13 = 3.5 5% trimmed mean = Notice that this trimmed mean is equal to the median that we previously calculated. By eliminating the effects of outliers, the median and resulting mean should be in close proximity. The symbol, x(T ) , is used to represent a trimmed mean. The only problem with this symbol is that it does not indicate whether we are dealing with a 5%, 10%, 15% or 20% trimmed mean. Example 1 Twenty two runners of the 100 m dash were randomly selected from colleges and universities in Canada. The time of each runner in the last competition was recorded. Of these runners, one person had pulled a hamstring and another had tripped during their last competition. The times in seconds are recorded below. Determine the mean, median, and 10% trimmed mean. 10.23 10.89 11.76 9.87 11.33 10.75 9.96 11.54 10.52 18.57 9.72 12.05 11.56 10.15 19.42 11.68 12.09 11.49 11.67 10.19 10.52 9.99 Answer: 10.83 + 10.89 + 11.76 + ... + 10.19 + 10.52 + 9.99 22 = 11.63 Mean = Median: Rearrange the data points from smallest to largest. Since we are dealing with an even number of data points (22), then the median is the mean of the two data points that share the middle of the ordered list. 9.72, 9.87, 9.96, 9.99,…, 10.75, 10.89, 11.33, 11.49,…, 12.05, 12.09, 18.57, 19.42 Median = NSSAL ©2011 10.89 + 11.33 = 11.11 2 29 Draft C. D. Pilmer 10% Trimmed Mean If 10% of the number of data points (i.e. 10% of 22) is 2.2, we would round down to 2 (round to nearest whole number). We will now drop two data points from the bottom and two data points from the top of the data set, and then work out the mean of the remaining eighteen data points. 9.72, 9.87, 9.96, 9.99, 10.15,…, 11.76, 12.05, 12.09, 18.57, 19.42 9.96 + 9.99 + 10.15 + ... + 11.76 + 12.05 + 12.09 18 = 11.02 10% trimmed mean = Mode The mode of a set of data is the value in the set that occurs most frequently. For the following data, the mode is 6 because it occurs more times than any other value. {2, 3, 4, 4, 5, 6, 6, 6, 6, 7, 7, 7, 9, 10} Mode = 6 Many textbooks and websites refer to the mode as a measure of central tendency; this is incorrect. Although the mode is often around the center of the data set when the points are arranged from smallest to largest, this is not always the case. Consider the data we previously examined concerning Charlie's and Angela's Graduate Math learners. Data for Charlie's Learners Order the data points from smallest to largest, and identify the data point that occurs most frequently. 69, 71, 73, 74, 77, 78, 78, 78, 79, 81, 82, 84, 86, 88, 91, 92, 95 Mode = 78 Data for Angela's Learners Order the data points from smallest to largest, and identify the data point(s) that occurs most frequently. 62, 63, 64, 65, 66, 68, 77, 78, 80, 93, 94, 95, 95, 98, 99, 99 The data points 95 and 99 occur the most frequently therefore we state that is data set is bimodal. Mode = 95 and 99 The mode for the Charlie's data is close to the center of the data set, however, the modes for Angela's data is not near the center. NSSAL ©2011 30 Draft C. D. Pilmer Questions Please use the appropriate symbols ( x , µ , and x(T ) ) when answering these questions. 1. A study regarding the size of winter wolf packs in regions of the United States, Canada, and Finland was conducted. The following data from 18 randomly selected packs was obtained. 2 3 15 8 7 8 2 4 13 7 3 7 10 7 5 4 2 4 (a) Are we dealing with a sample or a population? _____________________ (b) Determine the mean, median, and mode. (c) Why would the researchers not likely use a trimmed mean with this data set? 2. A local cab company has a fleet of nine cars. The company kept the records for the amount of money each vehicle required for a one week period. The data is shown below. $125 $157 $210 $139 $182 $167 $143 $150 $162 (a) Are we dealing with a sample or a population? _____________________ (b) Are we dealing with a numerical or categorical data set? _____________________ (c) Determine the mean, median, and mode. NSSAL ©2011 31 Draft C. D. Pilmer 3. A magazine conducted a survey where they wished to understand the average class size of first year courses at a local community college. They randomly selected 17 first year classes and obtained the following numbers. 23 37 36 40 39 115 28 25 23 32 27 16 15 31 27 34 (a) Are we dealing with a sample or a population? 41 ____________________ (b) Determine the mean, median, mode, 5% trimmed mean, and 10% trimmed mean. (c) Why is it appropriate to use trimmed means in this situation? (d) If this data set was comprised of 78 data points and we wanted to calculate a 5% trimmed mean, how many data points would be dropped from the bottom and top of the data set? 4. A new subdivision outside of Halifax was constructed over the last few years. Barb wanted to know what the average value of the new homes was. She was not prepared to look at the assessed values of all 218 new homes. Instead she randomly selected 24 homes and recorded their assessed values. These values in thousands of dollars are shown below. 267 265 226 254 231 221 246 252 253 241 261 589 243 269 267 253 287 320 221 264 257 249 226 267 NSSAL ©2011 32 Draft C. D. Pilmer (a) Calculate the mean, median, mode, and 5% trimmed mean. (b) Which of these measures is not influenced or less influenced by extremely high or low data points? (c) Would a histogram or a bar graph be used with this data set? 5. In gymnastics and diving, several judges score each athlete. The final score for the athlete is calculated by removing the high and low scores and averaging the remainder. Why do you think they use this trimmed mean scoring method in gymnastics and diving? NSSAL ©2011 33 Draft C. D. Pilmer Box and Whisker Plots Box and whisker plots, also called box plots, are a quick graphic approach for examining one or more sets of data. It is named such because the middle portion is comprised on a rectangular box which typically has a line (whisker) extending from the two ends of the box. Whisker 15 20 Box 25 Whisker 30 35 40 The box and whisker plot provides us with five critical pieces of information regarding the data that was used to construct it. (Refer to the diagram below.) • We are supplied with the minimum value in our data set. In this case, that value is 17. • We are supplied with the maximum value in our data set. In this case, that value is 36. • We are supplied with the median (or middle) of the data set. In this case, the median is 26. • We are supplied with the lower quartile (also called first quartile or Q1). This value is found by working out the median of the numbers below the median of the entire set of data. The lower quartile is the number that 25% of the data is below. In this case, the lower quartile is 21. • We are supplied with the upper quartile (also called third quartile or Q3). This value is found by working out the median of the numbers above the median of the entire set of data. The upper quartile is the number that 25% of the data is above. In this case, the upper quartile is 30. minimum lower median upper value quartile quartile 15 20 25 30 maximum value 35 40 Before we learn how to construct a box and whisker plot, we are going to look at a sample question involving a real world context where we have to compare two plots. NSSAL ©2011 34 Draft C. D. Pilmer Example 1 Two blood testing departments at different Nova Scotia hospitals recorded their patient wait times in minutes. This data was used to construct the two box and whisker plots. Department A Department B 0 5 10 15 20 25 30 How do the wait times compare at these two blood testing departments? Answer: Although the minimum value for Department B is 2 minutes less than the minimum value for Department A, and the lower quartile for Department B is 1 minute less than the lower quartile of Department A, the overall results for Department A are better. The median or Department A is slightly better, and the upper quartile and maximum value for Department A are much better than those for Department B. Department A appears to deliver a more consistent level of service in terms of wait times; that is why the box and whiskers are shorter for Department A's plot. We can say that the wait times are clustered closer together for Department A versus Department B. To explain this further, just look at the boxes for the two plots. Based on the first box, we can see that middle 50% of Department A's patients are served between 10 minutes and 16 minutes. Based on the second box, the middle 50% of Department B's patients are, however, served between 9 minutes and 21 minutes; a much longer time span. We can also conclude that generally patients had shorter wait times at Department A. Making a Box and Whisker Plot It is a six step process to construct a box and whisker plot. (i) Arrange the data points in order of size, from smallest to largest. (ii) Identify the minimum value and maximum value. (iii) Determine the median. (iv) Find the lower quartile by finding the median of the numbers below, but not including, the median of the entire set of numbers. (v) Find the upper quartile by finding the median of the numbers above, but not including, the median of the entire set of numbers. (vi) Draw your box and whisker plot along a number line using the values you found in steps (ii) through (v). NSSAL ©2011 35 Draft C. D. Pilmer Example 2 Construct a box and whisker plot for the following data. 22, 4, 11, 24, 18, 9, 19, 21, 13 Answer: (i) Arrange from smallest to largest 4, 9, 11, 13, 18, 19, 21, 22, 24 (ii) Minimum Value = 4, Maximum Value = 24 (iii) Find the median (i.e. middle value). 4, 9, 11, 13, 18, 19, 21, 22, 24 Median = 18 (iv) Find the lower quartile. This is done by taking the lower 50% of the data, not including the median from step (iii), and finding the median of these data points. 4, 9, 11, 13 9 + 11 Lower Quartile = 10 = 10 2 (v) Find the upper quartile. This is done by taking the upper 50% of the data, not including the median from step (iii), and finding the median of these data points. 19, 21, 22, 24 21 + 22 Upper Quartile = 21.5 = 21.5 2 (vi) Draw the plot along a number line. 5 10 15 20 25 Example 3 Display the following as a box and whisker plot. 10, 14, 21, 26, 16, 12, 14, 9, 17, 26 Answers: (i) Arrange from smallest to largest. 9, 10, 12, 14, 14, 16, 17, 21, 26, 26 (ii) Minimum Value = 9, Maximum Value = 26 (iii) Find the median. 9, 10, 12, 14, 14, 16, 17, 21, 26, 26 14 + 16 Median = 15 = 15 2 (iv) Find the lower quartile using the lower 50% of the data, not including the median. 9, 10, 12, 14, 14 Lower Quartile = 12 (v) Find the upper quartile using the upper 50% of the data, not including the median. 16, 17, 21, 26, 26 Upper Quartile = 21 NSSAL ©2011 36 Draft C. D. Pilmer (vi) Draw plot along a number line. 5 10 15 20 25 Questions 1 Construct a box and whisker plot for each of the following sets of data. (a) 30, 15, 6, 24, 19, 15, 17, 21, 20, 11, 9 Remember to start by reorganizing the data. 5 10 15 20 25 30 (b) 45, 46, 37, 52, 33, 34, 43, 43, 48, 50, 49, 43, 46, 40 Remember to start by reorganizing the data. 25 NSSAL ©2011 30 35 40 37 45 50 55 Draft C. D. Pilmer (c) 31, 26, 38, 25, 24, 29, 31, 37, 38, 30, 40, 27, 24, 24, 31, 26, 33 20 25 30 35 40 45 35 40 45 (d) 38, 37, 40, 28, 34, 36, 35, 41, 38, 35 20 25 30 2. A reaction time experiment is conducted in several adult education classrooms. In the experiment one student releases a ruler and a second student tries to grasp it as quickly as possible. The distance that the ruler drops is one way to measure the second student's reaction time. For example, if Student A's ruler only drops 7 cm compared to Student B's ruler that drops 12 cm, then we could say that Student A has a better reaction time. NSSAL ©2011 38 Draft C. D. Pilmer (a) Each member of Mrs. Leck's math class participated in the experiment. The following data was collected. Construct a box-and-whisker plot. 18 22 10 19 12 21 7 16 22 20 9 20 11 5 10 15 20 25 30 (b) Mr. Porter's class and Mr. Churchill's class participated in the same experiment. A boxand-whisker plot was constructed for both classes. Mr. Porter's Class Mr. Churchill's Class 5 10 15 20 25 30 How do the two classes compare in terms of reaction times? (c) Mrs. Lowe's class and Mr. Vroom's class participated in the same experiment. The following data was collected. Mrs. Lowe's Class 9 17 6 12 15 20 10 17 13 19 20 10 Mr. Vroom's Class 16 20 23 10 23 18 6 21 17 23 15 Construct two box-and-whisker plots. NSSAL ©2011 39 Draft C. D. Pilmer 5 10 15 20 25 30 How do the two classes compare in terms of reaction times? (d) Mrs. Burchill's class and Mr. Rhodenizer's class participated in the same experiment. The following data was collected. Mrs. Burchill's Class 16 7 12 5 21 13 16 10 18 11 8 19 14 11 Mr. Rhodenizer's Class 9 14 13 19 8 16 11 22 14 6 11 Construct two box-and-whisker plots. 5 10 15 20 25 30 How do the two classes compare in terms of reaction times? NSSAL ©2011 40 Draft C. D. Pilmer Using Technology to Make Box and Whisker Plots The TI-83 and TI-84 graphing calculators can draw box and whisker plots. This is particularly useful when we have lots of data. In this example we are going to use two sets of data to create two box and whisker plots at the same time. First Data Set 5.8 3.9 11.0 4.5 7.2 6.0 9.3 6.2 5.3 4.7 4.5 14.5 10.2 3.2 6.1 8.0 16.1 7.1 12.7 6.9 5.2 15.9 7.8 13.2 4.7 6.7 Second Data Set 7.3 10.2 8.3 13.2 7.2 12.6 9.9 7.7 5.0 9.0 9.4 6.9 9.7 8.7 7.5 4.9 8.3 8.2 8.1 8.5 7.9 10.0 8.6 7.7 4.8 7.2 3.1 4.9 Procedure: 1. Enter the First Data Set in List 1 and the Second Data Set in List 2 STAT > EDIT > Edit > Enter first data set in L1 > Enter second data set in L2 2. Turn on the Plots STAT PLOT > Select Plot 1 > Select On, Box and Whisker, and L1 > STAT PLOT > Select Plot 2 > Select On, Box-and Whisker, and L2 3. Draw the Box-and-Whisker Plot ZOOM > ZoomStat > TRACE > Move the right, left, up and down buttons to see the different values on the box and whisker plots. NSSAL ©2011 41 Draft C. D. Pilmer Questions In the following questions you will be asked to draw histograms as well as box-and-whisker plots. You are required to draw the histograms by hand and the box and whisker plots using technology. 1. Mrs. Ross is coaching her daughter's junior high basketball team. She has three players to choose from the bench. The statistics for each of the players is shown below. You are going to use your knowledge of statistics to help Mrs. Ross in making a selection. Tanya 8 4 20 22 25 14 23 24 2 10 23 25 16 2 Barb 22 6 12 18 18 12 25 14 13 20 8 20 18 16 Suzette 30 29 11 16 4 5 20 6 8 22 9 6 28 11 25 9 9 (a) Using technology, construct three box-and-whisker plots. 0 5 10 15 20 25 30 (b) Determine the mean score for each player. (c) Draw three histograms for the three sets of data. Note that the classes will include the first number but not the second. For example the class 0 to 5 includes 0, but not 5. Tanya Class 0 to 5 5 to 10 10 to 15 15 to 20 20 to 25 25 to 30 30 to 35 NSSAL ©2011 Frequency Barb Class 0 to 5 5 to 10 10 to 15 15 to 20 20 to 25 25 to 30 30 to 35 42 Frequency Suzette Class 0 to 5 5 to 10 10 to 15 15 to 20 20 to 25 25 to 30 30 to 35 Frequency Draft C. D. Pilmer (d) Which player has two distinct clusters within their data? __________________ (e) Who is the best player? __________________ (f) Who is the most consistent player? __________________ (g) What range of scores would be considered Tanya's top 25%? __________________ (h) What range of scores would be considered Barb's bottom 25%? __________________ (i) What range of scores would be considered Suzette's top 50%? __________________ 2. Mrs. Tuttle-Comeau is an assistant coach for her son's high school track and field team. At the last track meet (Track Meet A) she gathered the following data regarding 30 sprinters in the 100 m race. Each of these pieces of data represents the best time each of the high school sprinters obtained during this meet. 11.0 12.2 11.5 12.5 10.6 12.2 12.1 12.8 11.0 11.2 13.0 11.6 12.2 12.2 10.9 12.7 11.2 12.0 11.4 13.2 10.7 13.7 12.2 11.5 10.9 16.2 11.1 12.9 11.9 12.2 (a) Determine the mean time. (b) Construct a box and whisker plot for this data. 10 NSSAL ©2011 11 12 13 14 43 15 16 Draft C. D. Pilmer (c) Construct a histogram. Note that the classes will include the first number but not the second. For example the class 10 to 11 includes 10, but not 11. Class 10 to 11 11 to 12 12 to 13 13 to 14 14 to 15 15 to 16 16 to 17 Frequency (d) Are there two distinct clusters within this data? __________________ (e) What range of times would place an individual in the top 50% of the competitors? (f) What range of times would place an individual in the bottom 25% of the competitors? (g) What range of times would place an individual in the top 25% of the competitors? (h) Here's a box-and-whisker plot for another track meet (Track Meet B). Which track meet, A or B, resulted in a greater percentage of strong performances? How did you arrive at this answer? 10 NSSAL ©2011 11 12 13 14 44 15 16 Draft C. D. Pilmer 3. Body mass index (BMI) is a calculation that uses an individual's height and weight to estimate how much body fat they have. In Canada a BMI is recorded in kg/m2 and then those results are then matched with one of four categories designated by Health Canada. These categories are: • underweight (BMIs less than 18.5); • normal weight (BMIs 18.5 to 24.9); • overweight (BMIs 25 to 29.9), and • obese (BMIs 30 and over). The BMIs for adult learners from two different college classes were calculated and recorded. Class A 29.3 27.3 24.3 23.5 27.2 28.6 20.2 24.6 27.3 29.4 21.8 25.2 27.9 28.5 26.8 23.1 28.4 26.9 22.9 28.1 26.7 22.5 Class B 30.2 23.6 21.4 18.8 17.2 24.2 28.6 19.6 20.9 32.7 26.8 23.8 20.7 18.5 30.8 31.4 21.8 22.5 17.8 18.3 Using technology, construct two box and whisker plots and record the results below. 15 20 25 30 How to the BMI's for the two classes compare? NSSAL ©2011 45 Draft C. D. Pilmer Standard Deviation Measures of central tendency (median and mode) do not give us any indication of how the data is spread out. Consider the following two sets of data. First Data Set: 13, 14, 15, 15, 15, 16, 17 Second Data Set: 10, 12, 13, 15, 17, 18, 20 The mean for both of these data sets is 15, however, the individual pieces of data in these sets are considerably different. In the first set, the numbers range from 13 to 17, and clearly cluster around the number 15. In the second set the numbers range from 10 to 20 and tend to be more spread out around the mean. The dispersion is far greater in the second set, than in the first. Standard deviation is one way of measuring dispersion. If the standard deviation is low, then the data clusters around the mean. If the standard deviation is high, then the data is spread out around the mean. Without getting into the actual calculations, the standard deviation for the first data set is 1.20 and the standard deviation for the second data set is 3.30. The larger number indicates greater dispersion. Calculating Standard Deviation Before we get to the calculations, we have to remind you of an important point and introduce two formulas. In the unit introduction we stated that this unit would focus on populations, rather than samples. A population is the set representing all measurements of interest to an investigator while a sample is simply a subset of the measurements from the population chosen at random. We learned that the mean is calculated by adding all the data values and then dividing up the number of data values. This can be expressed using the following formula. µ= x1 + x 2 + x3 + ... + x n n where µ (mu) is the population mean The formula for population standard deviation, σ (sigma), is shown below. You are not expected to memorize this formula. σ= (x1 − µ )2 + (x2 − µ )2 + (x3 − µ )2 + ... + (xn − µ )2 n This formula requires that you complete six steps. Step 1: Find the mean; µ . Step 2: Calculate the difference between each data value and the mean; xi − µ . Step 3: Square those differences found in Step 2; ( xi − µ ) 2 Step 4: Add the squared differences; ( x1 − µ ) + (x2 − µ ) + (x3 − µ ) + ... + ( xn − µ ) 2 2 2 2 Step 5: Divide the sum from Step 4 by the number of data values. Step 6: Square root the value from Step 5. NSSAL ©2011 46 Draft C. D. Pilmer The easiest way to learn how to use this formula (i.e. complete the six steps) is to construct a table where only small portions of the calculation are completed at any one time. Example 1 Determine the standard deviation for the following set of data. 10, 12, 13, 15, 17, 18, 20 Answer: Find the mean. x1 + x 2 + x3 + ... + x n n 10 + 12 + 13 + 15 + 17 + 18 + 20 µ= 7 µ = 15 µ= Construct the table. xi xi − µ (Step 2) 10 -5 12 -3 13 -2 15 0 17 2 18 3 20 5 (Step 1) ( x i − µ )2 (Step 3) 25 9 4 0 4 9 25 Sum = 76 (Step 4) 76 7 σ = 3.3 σ= (Steps 5 and 6) The population standard deviation is 3.3. Example 2 Mrs. Gillis teaches math to adults. At the end of the year she examines the final marks for all of her students who have completed the course. She wants to work out the standard deviation of those marks. 87 72 91 Find the mean. µ= 82 74 93 75 83 78 75 Answer: NSSAL ©2011 x1 + x 2 + x3 + ... + x n n 47 Draft C. D. Pilmer 87 + 72 + 91 + 82 + 74 + 93 + 75 + 83 + 78 + 75 10 µ = 81 µ= Construct the table. xi xi − µ 87 72 91 82 74 93 75 83 78 75 ( x i − µ )2 6 -9 10 1 -7 12 -6 2 -3 -6 36 81 100 1 49 144 36 4 9 36 Sum = 496 496 10 σ = 7.04 σ= The population standard deviation is 7.04. Questions 1. Determine the standard deviation for the following data. 25 32 24 28 31 28 µ= xi NSSAL ©2011 xi − µ ( x i − µ )2 48 Draft C. D. Pilmer 2. Determine the standard deviation for the following data. 3.7 4.3 5.0 4.6 4.0 4.7 3.9 4.2 µ= ( x i − µ )2 xi − µ xi 3. Two data sets have been provided. 15 14 13 18 16 13 16 15 15 17 15 16 14 11 19 16 11 16 (a) Calculate the standard deviation for each data set. µ= µ= xi NSSAL ©2011 xi − µ ( x i − µ )2 xi 49 xi − µ ( x i − µ )2 Draft C. D. Pilmer (b) The standard deviations are different for the two data sets. What is this telling you? 4. Barb, a math instructor, recorded the height in centimetres of all of the male students in her Level IV math courses. She obtained the following measurements. 181 173 184 183 190 180 186 176 185 (a) What is the median for this data? (b) What is the mean for this data? (c) Is Barb dealing with a categorical or numerical data set? (d) Determine the standard deviation. xi NSSAL ©2011 50 Draft C. D. Pilmer (e) Another instructor at different campus also has 9 male learners in his Level IV Math courses. He measured their heights. He found the mean to be 182 cm with a standard deviation of 6.4 cm. Based on these results, what can you say about the heights of this instructor’s male learners compared to Barb’s male learners? (f) A third instructor at another campus also has 9 male learners in her Level IV Math courses. She measured their heights. She found the mean to be 179 cm with a standard deviation of 4.8 cm. Based on these results, what can you say about the heights of this instructor’s male learners compared to Barb’s male learners? 5. Without attempting any calculations, match each standard deviation with the appropriate histogram. Please note that all of the histograms are drawn at the same scale. Standard Deviations: (a) 0.69 (b) 1.40 (c) 3.34 (d) 3.62 Matches with _____ Matches with _____ Histograms: (i) Matches with _____ Matches with _____ 6. Create two data sets the meet all of the following conditions. • They have at least six pieces of data. • They must have a mean of 10. • They have standard deviations that are quite different. NSSAL ©2011 51 Draft C. D. Pilmer Using Technology to Calculate Population Standard Deviation In the last section we learned how to work out the population standard deviation ( σ ) using paper and pencil. The TI graphing calculators can calculate this along with several other measures we have been exposed to in this unit. Using such technology is particularly useful when we are dealing with a large number of data points. Example Tylena was teaching an evening class comprised of 30 adult learners. She asked them all to complete a series of thirty basic math problems. She recorded how long it took for each learner to complete the task in minutes. The data is shown below. 40 60 (a) (b) (c) (d) 46 56 68 44 51 53 42 58 55 60 48 45 52 52 38 55 49 46 56 51 50 40 35 50 54 64 50 45 Draw a histogram using technology. Use class widths of 5 starting at 35. Determine the mean time. Determine the standard deviation. Determine the median. Answers: Step 1: Enter the Data in the Calculator STAT > Edit > If data already exists in L1 then move the > Enter the data in L1 cursor up so L1 is highlighted, press CLEAR, and move the cursor back down. Step 2: Draw the Histogram STATPLOT > Select Plot 1 > Turn on the plot, select histogram, Xlist > WINDOW should be L1 and Freg should be 1. > Set Xmin at 35, Xmax at 70, Xscl at 5 > GRAPH > TRACE > Use the right Ymin at 0, Ymax at 10, Yscl at 1 and left arrows Note: The Xmin on the Window setting is the starting point for the first class and the Xscl sets the class width. In this case the first class is 35 - 40. NSSAL ©2011 52 Draft C. D. Pilmer STAT > CALC > 1-Var Stats > Enter the List (typically L1) > ENTER The calculator does not report the population mean ( µ ) however, as we previously learned, the formula for sample mean and population mean are the same. The calculator reports the sample mean x , but we know that we are actually dealing with a population mean of 50.4 minutes. We are also asked to determine the standard deviation, which is actually the population standard deviation ( σ ). This calculator uses the symbol σ x , rather than σ , to represent the population standard deviation. Therefore our population standard deviation is 7.5 minutes. To find the median, scroll down using the down arrow while still on the 1-Var Stats results until you find Med. The median in this case is 50.5. () (b) population mean ( µ ) = 50.4 minutes (c) population standard deviation ( σ ) = 7.5 minutes (d) median = 50.5 minutes Questions 1. Provincial governments keep records of the number of young offenders who are incarcerated each year. The incarceration rates vary greatly from province to province. In 2006 Nova Scotia reported an incarceration rate of 9.91. That means that 9.91 young persons out of 10 000 young persons was incarcerated. Below you will find the incarceration rates for the provinces and territories for 2006. (Source: Statistics Canada) Province YT NT NU BC AB Rate 8.57 46.12 20.49 4.45 7.18 Province SK MB ON QC Rate 24.54 21.25 7.51 3.89 Province NB PE NS NL Rate 10.20 7.21 9.91 11.93 (a) Are we dealing with a population or a sample? Explain. (b) Using technology draw a histogram showing the distribution of incarceration rates. Use class widths of 5 starting at 0. (c) Determine the mean, median, and standard deviation. NSSAL ©2011 53 Draft C. D. Pilmer (d) There is a substantial difference between the mean and median. Why is this so? 2. Below you will find a list of Prime Ministers of Canada since Confederation in 1867. We have also been supplied with their age upon first taking office as PM. Prime Minister (PM) John A. MacDonald Alexander Mackenzie John Abbott John Sparrow Thompson Mackenzie Bowell Charles Tupper Wilfrid Laurier Robert Borden Arthur Meighen William Lyon Mackenzie King Richard Bennett Louis St-Laurent John Diefenbaker Lester Pearson Pierre Trudeau Joe Clark John Turner Brian Mulroney Kim Campbell Jean Chretien Paul Martin Stephen Harper First Term Starts 1867 1873 1891 1892 1894 1896 1896 1911 1920 1921 1930 1948 1957 1963 1968 1979 1984 1984 1993 1993 2003 2006 Age 52 51 70 48 70 74 54 57 46 47 60 66 61 65 48 39 55 45 46 59 65 46 (a) Are we dealing with a population or a sample? Explain. (b) Using technology draw a histogram showing the distribution of ages for PMs first taking office. Use class widths of 5 starting at 35. (c) Determine the mean PM age for first taking office. NSSAL ©2011 54 Draft C. D. Pilmer (d) Determine the standard deviation. (e) Determine the median. (f) What can you conclude based on the histogram and standard deviation? 3. Cholesterol is waxy, fat-like substance found in all cells of the body. Our bodies need it to make hormones, vitamin D, and substances used in digestion. However, cholesterol, specifically low density lipoprotein (LDL) cholesterol, in high amounts is dangerous to one's health. The following chart looks at various cholesterol ranges and their classifications. The units of measure are millimoles per litre (mmol/L). LDL Cholesterol Levels Classification below 2.6 desirable from 2.6 to 3.3 near optimal from 3.4 to 4.1 borderline from 4.2 to 4.9 high above 4.9 too high Dr. Gillis is looking through the records for all her male patients over the last year who are between the ages of 50 and 60 years. They have all had blood work and she records all the LDL cholesterol levels for these patients in the chart below. 4.1 5.2 3.6 2.9 3.4 2.7 5.1 5.3 2.4 2.6 2.5 2.8 2.5 3.0 3.5 4.6 3.8 4.9 4.8 3.3 4.4 3.2 2.4 3.0 2.3 3.7 4.2 3.7 3.3 3.4 (a) Using technology draw a histogram showing the distribution of LDL cholesterol levels. Use class widths of 0.8 starting at 1.8. (b) Determine the mean LDL cholesterol levels for Dr. Gillis' male patients between the ages of 50 and 60 years. NSSAL ©2011 55 Draft C. D. Pilmer (c) Determine the standard deviation. (d) Determine the median. (e) What can you conclude based on the histogram and standard deviation? NSSAL ©2011 56 Draft C. D. Pilmer Distributions A frequency polygon is the shape that is formed when midpoints of the tops of the bars on a histogram are joined by straight lines. In this case, the frequency polygon forms a bell-shaped curve that is associated with a population that follows a normal distribution. Many variables observed in nature, including heights, weights, and reaction times, follow normal distributions. Consider the heights of female students at college. There are a few women who are less than 5 feet tall, a few who are taller than 6 feet, but the majority of the women are probably between 5’3” and 5’8”. We would expect a normal distribution for the heights of women attending college. Let’s consider a population that results in a normal distribution. The normal curve will be centered about population mean ( µ ). The standard deviation ( σ ) determines the extent to which the curve spreads out. If we look at the two normal distributions supplied below, we can see that both distributions are A centered around the same value, 65. That means that the mean for both of these populations is 65. The standard deviations, although not supplied, are not the same. The standard deviation for normal distribution A must be lower than B that for distribution B because the curve is narrowing meaning that the data points are more clustered around the mean. Please note that the horizontal axis is labeled x. This indicates that we are looking at the distribution of the individual data points denoted by the symbol x. NSSAL ©2011 57 Draft C. D. Pilmer Do not assume that we have to have a perfectly symmetrical bellshaped distribution to have a normal distribution. The histogram on the right would create a frequency polygon which is almost symmetrical, but we would still say that we are dealing with a normal distribution. For this course, most of our time will be spent examining situations that follow normal distributions. However, it is important to understand that other types of distributions exist. These other types are shown below. A uniform distribution occurs when every class has equal frequency. A skewed distribution occurs when one tail is much larger than the other tail. A bimodal distribution occurs when two classes with the largest frequencies are separated by at least one class. Uniform Distribution Skewed Left Distribution Skewed Right Distribution Bimodal Distribution Question 1. Based on the situation, what type of distribution (normal, uniform, bimodal,…) would you likely obtain? Distribution Type (a) You randomly select 100 students at an elementary school and each must report their grade level. There are two classes at each grade level and between 22 to 26 students in each class. What would the distribution of grade levels look like? (b) Two groups of athletes are running the 100 m dash. One group is comprised of males 12 years of age or younger, and the other is comprised of males between 16 and 20 years of age. You randomly select 150 athletes and ask them to report their time for the 100 m dash. What would the distributions of times look like? (c) Mrs. Chopra teaches one of the three grade six classes. Normally the administration tries to distribute the strongest math students evenly between the three classes. That did not occur this year and now Mrs. Chopra has a large portion of strong math students in her class. If her class was asked to complete a fair math test, what would the distribution of marks look like? NSSAL ©2011 58 Draft C. D. Pilmer Distribution Type (d) You randomly select 100 females between the ages of 20 and 29 and record their heights. What would the distribution of heights look like? (e) A college instructor had what he described as an average class of students. From his perspective there were a few weak students, a few strong students but the majority of the students were of average ability. He gave the class an extremely challenging test where only the strongest students could maintain good marks, ranging from 75% to 95%. The rest of the students did poorly where many resoundingly failed the test. What would the distribution of marks for this test look like? (f) You spin the following spinner 300 times recording how many times you obtain each of the results (1, 2, 3, 4). What would the distribution of results look like? 2 1 3 4 (g) A nursing student working at the children's hospital looks at the birth weights of all babies born in the hospital during June, July, and August. What would the distribution of birth weights look like? (h) Eastern American Toad, common in Nova Scotia, enter the world as small dark polliwogs, become miniature toads, and finally mature to be adult toads. What would the distribution of ages for Eastern American Toads of all forms (polliwogs to adults) look like? (i) A personal trainer at a coed gym recorded the maximum resistance people would set on a particular piece of exercise equipment over a one month period. What would the distribution of resistance settings look like? (j) A kinesiologist is recording the grip strength of 250 randomly selected males between the ages of 25 and 35. What would the distribution of grip strengths look this? NSSAL ©2011 59 Draft C. D. Pilmer Normal Distributions and the 68-95-99.7 Rule In the last section we learned about symmetrical bell-shaped distributions called normal distributions. We also mentioned that the normal curve will be centered about population mean ( µ ), and that the standard deviation ( σ ) determines the extent to which the curve spreads out. Lower standard deviations result in taller narrower curves. There is something else that is important to learn about normal curve. It is the 68-95-99.7 rule. According to the 68-95-99.7 rule, in any bell-shaped distribution, the following holds true. • Approximately 68% of the data points will lie within one standard deviation of the mean. • Approximately 95% of the data points will lie within two standard deviations of the mean. • Approximately 99.7% of the data points will lie within three standard deviations of the mean. Let's describe this rule again using the proper symbols that we use for populations. According to the 68-95-99.7 rule, in any bell-shaped distribution of a population, the following holds true. • Approximately 68% of the data points are between µ − σ and µ + σ . • Approximately 95% of the data points are between µ − 2σ and µ + 2σ . • Approximately 99.7% of the data points are between µ − 3σ and µ + 3σ . Let’s see how this rule applies to a population with a normal distribution where the population mean ( µ ) is 40 and the population standard deviation ( σ ) is 10. This distribution is shown below. Notice that it is centered about the mean. For this population we would expect that approximately 68% of the data points would be between 30 ( µ − σ or 40-10) and 50 ( µ + σ or 40+10). We would expect that approximately 95% of the data points would be between 20 ( µ − 2σ ) and 60 ( µ + 2σ ). Finally we would expect that approximately 99.7% of the data points to be between 10 ( µ − 3σ ) and 70 ( µ + 3σ ). NSSAL ©2011 60 Draft C. D. Pilmer Let's take what we just learned and expand upon it. Consider the following statements for a normal population. • • If 68% of the data points are found between µ − σ and µ + σ , then 34% of the data points would be between µ and µ + σ . If 68% of the data points are found between µ − σ and µ + σ , then 34% of the data points would be between µ − σ and µ . 68% 34% 34% µ −σ µ +σ µ x If we extend this line of thinking, we can state the following. • • • • If 95% of the data points are found between µ − 2σ and µ + 2σ , then 47.5% of the data points would be between µ and µ + 2σ . If 95% of the data points are found between µ − 2σ and µ + 2σ , then 47.5% of the data points would be between µ − 2σ and µ . If 99.7% of the data points are found between µ − 3σ and µ + 3σ , then 49.85% of the data points would be between µ and µ + 3σ . If 99.7% of the data points are found between µ − 3σ and µ + 3σ , then 49.85% of the data points would be between µ − 3σ and µ . Hopefully it makes sense that 50% of the data points should be above the mean, and 50% of the data points must be below the mean. It should also be noted that these values (64%, 95%, 99.7%, 34%, 47.5%,…) can be expressed as probabilities. Probability is the chance that something will happen - how likely it is that some event will occur. Referring back to our normal distribution, there is a 0.64 probability that a randomly selected data point can be found within one standard deviation of the mean (i.e. from µ − σ to µ + σ ). NSSAL ©2011 61 Draft C. D. Pilmer Example 1 For a normal population with a mean of 15 and standard deviation of 2, what percentage of the data points would measure (a) between 15 and 19? (b) between 13 and 21? (c) between 11 and 13? Answers: (a) This question could be restated. It would read, “What percentage of the data points would be between µ and µ + 2σ ?” (Reason: 15 is µ , and 19 is 2 σ to the right of µ ) 47.5% 15 µ x 19 µ + 2σ Therefore approximately 47.5% of the data points will be between 15 and 19. (b) This question could be restated. It would read, “What percentage of the data points would be between µ − σ and µ + 3σ ?” 34% 13 µ −σ 49.85% 15 µ 21 µ + 3σ x Therefore approximately 83.85% (34% + 49.85%) of the data points will be between 13 and 21. NSSAL ©2011 62 Draft C. D. Pilmer (c) This question could be restated. It would read, “What percentage of the data points would be between µ − 2σ and µ − σ ?” 34% 47.5% 11 13 µ −σ µ − 2σ 15 µ Therefore approximately 13.5% (47.5%-34%) of the data points will be between 11 and 13. Example 2 The quality control officer at a cereal factory knows that the mean weight for the cereal in their regular size box is 461 grams with a standard deviation of 6 grams. (a) What is the probability of randomly choosing a cereal box off the assembly line that weighs between 461 grams and 467 grams? (b) What is the probability of randomly choosing a cereal box off the assembly line that weighs between 455 grams and 479 grams? (c) What is the probability of randomly choosing a cereal box off the assembly line that weighs between 443 grams and 449 grams? (d) What is the probability of randomly choosing a cereal box off the assembly line that weighs more than 455 grams? (e) If we randomly chose 800 boxes, how many would we expect to be between 449 grams and 473 grams? Answers: (a) Attack this logically. • We were told that µ is 461, and that σ is 6. • We were told that we are dealing with boxes between 461 and 467 grams. Notice that 467 is 6 (or one standard deviations) away from 461 ( µ ). That means that 467 is actually µ + σ . • Let's find the percentage of data points that are between µ + σ and µ . The answer is 34%. • Now convert that percentage to a probability. The probability is 0.34. NSSAL ©2011 63 Draft C. D. Pilmer (b) Think logically. • 455 is one standard deviation to the left of the mean, and therefore can be expressed as µ − σ . • 479 is three standard deviations to the right of the mean and therefore can be expressed as µ + 3σ . • We actually need to find the percentage of boxes that are between µ − σ and µ + 3σ . • We know that 34% of the data points are between µ − σ and µ . We also know that 49.85% of the data points are between µ and µ + 3σ . Therefore we can conclude that 83.85% (34% + 49.85%) of the data points are between µ − σ and µ + 3σ . • Convert 83.85% to a probability of 0.8385. Based on this number, we can say that there is a very high chance that a randomly selected cereal box will have weight between 455 g and 479 g. (c) Think logically. • 443 is three standard deviations to the left of the mean, and therefore can be expressed as µ − 3σ . • 449 is two standard deviations to the left of the mean, and therefore can be expressed as µ − 2σ . • We actually need to find the percentage of boxes that are between µ − 3σ and µ − 2σ . • We know that 49.85% of the data points are between µ − 3σ and µ . We also know that 47.5% of the data points are between µ − 2σ and µ . Therefore we can conclude that 2.35% (49.85% - 47.5%) of the data points are between µ − 3σ and µ − 2σ . • Convert 2.35% to a probability of 0.0235. Based on this number, we can say that there is a very slight chance that a randomly selected cereal box will have weight between 443 g and 449 g. (d) Think logically. • 34% of the data points are between 455 ( µ − σ ) and 461 ( µ ). • 50% of the data points are greater than 461 ( µ ) • Therefore 84% of the data is greater than 455. This gives us a probability of 0.84 (e) The number 449 is µ − 2σ . The number 473 is µ + 2σ . We know that 95% of the data points should be two standard deviations to the left and right of the mean. As a probability, it is expressed as 0.95. 0.95 × 800 = 760 Of the 800 randomly selected cereal boxes, we would expect 760 boxes to be between 449 g and 473 g. NSSAL ©2011 64 Draft C. D. Pilmer Questions 1. Use the 68-95-99.7 rule on a distribution of data points with a population mean of 230 and a population standard deviation of 15 to answer the following questions. You may wish to draw and label a normal distribution curve to assist you with each of these questions. This is what we did in Example 1. (a) What percentage of the data points would measure between 215 and 245? (b) What percentage of the data points would measure between 230 and 260? (c) What percentage of the data points would measure between 215 and 230? (d) What percentage of the data points would measure between 185 and 230? (e) What percentage of the data points would measure between 200 and 245? (f) What percentage of the data points would measure between 215 and 275? (g) What is the probability that a randomly selected data point would be between 185 and 260? (h) What is the probability that a randomly selected data point would be between 245 and 260? NSSAL ©2011 65 Draft C. D. Pilmer (i) What is the probability that a randomly selected data point would be between 185 and 200? (j) What is the probability that a randomly selected data point would be between 245 and 275? (k) What is the probability that a randomly selected data point would be less 245? (l) What is the probability that a randomly selected data point is greater than 200? (m) What is the probability that a randomly selected data point is less than 215? 2. A company monitored the production of 2000 bagels for a one day period. They determined that the mean weight (population mean) of the bagels was 104 grams with a standard deviation of 3 grams. Assume the distribution of bagel weights is bell-shaped. You may choose to draw and label a normal distribution curve to assist you with each of these questions. (a) How many of the 2000 bagels were within 9 grams of the mean? (b) How many of the 2000 bagels were within 3 grams of the mean? NSSAL ©2011 66 Draft C. D. Pilmer (c) How many of the 2000 bagels are between 98 grams and 104 grams? (d) How many of the 2000 bagels are between 101 grams and 110 grams? (e) How many of the 2000 bagels are between 107 grams and 110 grams? (f) How many of the 2000 bagels are between 98 grams and 110 grams? (g) How many of the 2000 bagels are between 95 grams and 101 grams? (h) How many of the 2000 bagels are between 98 grams and 113 grams? (i) How many of the 2000 bagels are between 95 grams and 104 grams? (j) How many of the 2000 bagels are between 110 grams and 113 grams? (k) How many of the 2000 bagels are less than 98 grams? NSSAL ©2011 67 Draft C. D. Pilmer Z-Score In the last section, the problems used numbers that were always 1, 2, or 3 standard deviations from the mean. For example in question 1 (e), we were told that the population mean was 230 and the population standard deviation was 15, and then we were asked to find percentage of the data points that were between 200 and 245? The number 200 is exactly two standard deviations below the mean, while the number 245 is exactly one standard deviation above the mean. What if we were asked to find the percentage of data points that would be between 197 and 251? These two values are not 1, 2, or 3 standard deviations from the mean; rather, they are located some fractional amount of the standard deviation away from the mean. Because of this, the technique that we learned in the previous section will not work. We need another approach; we are going to use z-scores. In statistics, the z-score (also called the standard score) indicates how many standard deviations a data point is above or below the mean. It is found using the following formula. z= x−µ σ where x is the data point (also called an observation or raw value), µ is the population mean, and σ is the population standard deviation. Example 1 A population, which results in a bell-shaped distribution, has a mean of 26.1 and standard deviation of 2.3. How many standard deviations from the mean is each of these data points? (a) 28.9 (b) 24.7 Answers: (a) z= x−µ σ (b) 28.9 − 26.1 2.3 z = 1.22 z= x−µ σ 24.7 − 26.1 2.3 z = −0.61 z= z= The data point 28.6 is 1.22 standard deviations from the mean of 26.1. The z-score is positive because the data point is larger than the mean (i.e. to the right of the mean). The data point 24.7 is 0.61 standard deviations from the mean of 26.1. The z-score is negative because the data point is smaller than the mean (i.e. to the left of the mean). What we have just learned regarding z-scores does not help us answer questions like the one introduced at the beginning of this section. Original Question: We have a population, which results in a bell-shaped distribution, has a mean of 230 and standard deviation of 15. What percentage of data points that would be between 197 and 251? NSSAL ©2011 68 Draft C. D. Pilmer Using the z-score we can now determine how many standard deviations the data points 197 and 251 are away from the mean, 230. This, however, does not tell us the percentage of data points that are between 197 and 251. We need to learn about area under the standard normal curve. The mathematics necessary to understand how one determines the area under the standard normal curve is well beyond the scope of this course. At this level all we need to know is that the standard normal curve is centered at 0 (i.e. has a mean of 0), has a standard deviation of 1, that the total area under this curve is equal to 1, and that area is equal to the probability that a randomly selected data point falls within that interval. We use the standard normal curve to understand other populations that are normally distributed, even though these populations have different means and standard deviations. Standard Normal Curve: µ = 0 , σ = 1 , Area Under the Complete Curve = 1 If we look at the standard normal curve on the right, we notice that we have gone 2 standard deviations to the left and right of the mean (represented by the -2.0 and 2.0). The area under the curve within this interval (i.e. the shaded region on the diagram) is 0.9544. This area is equivalent to probability that a randomly selected data point falls within that interval. This makes sense when we remember that we had already learned that there is a 95% chance that a randomly selected data point is within two standard deviations of the mean. If we look at the next diagram, we have gone 1.2 standard deviations to the left of the mean and 1.6 standard deviations to the right of our mean on the standard normal curve. In this case, the area under the curve in that interval is 0.8301. That means that there is a 0.8301 probability that a randomly data point will fall within that interval. Area = 0.9544 Area = 0.8301 In the last two diagrams, we supplied the areas under the curves in the defined intervals but how do we determine these areas when they are not supplied? We have to use a chart and a procedure that is identical to what we used in the last section. The chart allows us use to determine areas/probabilities from a specific standard deviation to the mean. The easy way to show how to use the chart is through worked examples. NSSAL ©2011 69 Draft C. D. Pilmer Example 2 A population, which results in a bell-shaped distribution, has a mean of 250 and standard deviation of 30. What is the probability that a measurement from a randomly selected item is between 250 and 272? Answer: Start by considering the interval from 250 to 272. The 250 is equivalent to the population mean ( µ ). The 272 is 22 units to the right of the mean; we need to determine how many standard deviations this value (272) is away from the mean. This is when we use z-scores. z= x−µ σ 272 − 250 30 z = 0.73 z= We can now rephrase the original question. We are really trying to find the probability that a randomly selected data point is between µ and µ + 0.73σ . Now let's put this in the context of our standard normal curve, which is drawn on the right. Remember on our standard normal curve, the mean is 0 and the standard deviation is 1. We are going to find the area under this curve from 0 ( µ ) to 0.73 ( µ + 0.73σ ). The area under this curve in this interval has been shaded on our diagram. We can use our knowledge of the standard normal curve to understand other populations that are normally distributed, even though these populations have different means and standard deviations. The area under our standard normal curve from 0 to 0.73 is equivalent to the area under our original normal distribution from µ (250) to µ + 0.73σ (272). To find the area under our standard normal curve, we go to the Areas Under the Standard Normal Curve chart found in the back of this resource (page 96). We have reproduced a portion of this chart below. We work with the row labeled 0.7 and the column labeled 0.03 (Reason: 0.7 + 0.03 = 0.73). We find that this row and column intersect at 0.2673.   z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.0359 0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0.7 0.2580 0.2611 0.2642 0.2673 0.2704 0.2734 0.2764 0.2794 0.2823 0.2852 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . NSSAL ©2011 70 Draft C. D. Pilmer That means that the area under the standard normal curve between 0 ( µ ) and 0.73 ( µ + 0.73σ ) is 0.2673. In terms of our original normally distribution, it means that there is a 0.2673 probability that a randomly selected data point will be between 250 ( µ ) and 272 ( µ + 0.73σ ). Example 3 Data for a population was normally distributed with a mean of 167 and standard deviation of 18. What is the probability that a randomly selected data point from this population is between 144 and 181? Answer: This question is more challenging than the last one because neither of the values supplied (144 or 181) is the population mean. The lower limit, 144, is below the mean, while the upper limit, 181, is above the mean. We need to find out how much above and below these two values are but in terms of standard deviations. That means we need to work out the z-scores. z= x−µ z= σ x−µ σ 181 − 167 18 z = 0.78 144 − 167 18 z = −1.28 z= z= Our question can now be rephrased as "What is the probability that a randomly selected data point from this population is between µ − 1.28σ and µ + 0.78σ ?" To tackle this, we need to work with the standard normal curve and have to break the question into parts. We start by finding the area/probability on our standard normal curve from -1.28 ( µ − 1.28σ ) to 0 ( µ ), then find the area/probability from 0 ( µ ) to 0.78 ( µ + 0.78σ ), and finally we add the two areas/probabilities. Area/Probability between µ − 1.28σ and µ (Find 1.28 on the chart.)   z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.0359 0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.3980 0.3997 0.4015 NSSAL ©2011 71 Draft C. D. Pilmer Area/Probability between µ and µ + 0.78σ (Find 0.78 on the chart.)   z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.0359 0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0.7 0.2580 0.2611 0.2642 0.2673 0.2704 0.2734 0.2764 0.2794 0.2823 0.2852 0.3997 + 0.2823 = 0.6820 For the standard normal curve, the area from -1.28 to 0.78 is 0.6820. In terms of our original normal distribution, there is a 0.6820 probability that a randomly selected data point from this population is between 144 ( µ − 1.28σ ) and 181 ( µ + 0.78σ ). The Different Cases The biggest struggle with these questions is the determination of the areas since the chart only shows areas from 0 ( µ ) to the specified z value. There are five different cases we may encounter, two of which we have already examined in Examples 2 and 3. Case 1 This occurs when we need to find the area/probability between a given z value and 0 ( µ ). With these questions we simply use the chart once. This is what we did in Example 2. Case 2 This occurs when we need to find the area/probability between two given z values that are on either side of 0 ( µ ). With these questions, we find two separate area/probabilities and add them together. This is what we did in Example 3. = NSSAL ©2011 + 72 Draft C. D. Pilmer Case 3 This occurs when we need to find the area/probability between two given z values that are on same side of 0 ( µ ). With these questions, we find two separate areas/probabilities and subtract the smaller from the larger. = - Case 4 This occurs when we need to find the area/probability to the right of a positive z value, or to the left of a negative z value. With these questions, we take the area to the right (or left) of 0 (This area is always equal to 0.5 because it is half the area of our standard normal curve) and subtract the area from 0 to the z value. = Area always equals 0.5 Case 5 This occurs when we need to find the area/probability to the right of a negative z value, or to the left of a positive z value. With these questions, we take the area to the right (or left) of 0 (This area is always equal to 0.5 because it is half the area of our standard normal curve) and add the area from 0 to the z value. = + Area always equals 0.5 NSSAL ©2011 73 Draft C. D. Pilmer Example 4 Porphyrin is a pigment in blood protoplasm. In the population of healthy adults, the concentration of porphyrin is normally distributed with mean µ = 38 mg/dL and standard deviation σ = 12 mg/dL. (a) What is the probability that a randomly selected healthy adult would have a prophyrin concentration between 43 mg/dL and 54 mg/dL? (b) What is the probability that a randomly selected healthy adult would have a prophyrin concentration less than 47 mg/dL? Answers: (a) Both 43 and 54 are above the mean (38). We need to find out how much above these two values are but in terms of standard deviations. That means we need to determine the zscores. x−µ x−µ z= z= σ σ 43 − 38 z= 12 z = 0.42 54 − 38 z= 12 z = 1.33 Based on this work the question can be rephrased. "What is the probability that a randomly selected healthy adult would have a prophyrin concentration between µ + 0.42σ and µ + 1.33σ ?" Now let's put this in the context of our standard normal curve. We need to find the area under the curve (which is equivalent to the probability) from 0.42 to 1.33. Notice that both of these values are to the right of 0 ( µ on our standard normal curve). That means that we are dealing with Case 3. • Find the area/probability from 0 to 0.42. From the chart we find that the answer is 0.1628. • Find the area/probability from 0 to 1.33. From the chart we find that the answer is 0.4082. • Now subtract the two areas/probabilities. 0.4082 - 0.1628 = 0.2454 There is a 0.2454 probability that a randomly selected healthy adult would have a prophyrin concentration between 43 mg/dL and 54 mg/dL? (b) We start by finding how much 47 is above the mean (38) in terms of standard deviations. z= x−µ σ 47 − 38 12 z = 0.75 z= NSSAL ©2011 74 Draft C. D. Pilmer The question can now be rephrased. "What is the probability that a randomly selected healthy adult would have a prophyrin concentration less than µ + 0.75σ ?" Let's put this in the context of our standard normal curve. We need to find the area under the curve (which is equivalent to the probability) below 0.75. Notice we are trying to find the area under the curve to the left of a positive z value; this is Case 5. • Find the area/probability less than 0. It is always 0.5 because we are dealing with exactly half of our standard normal curve. • Find the area/probability from 0 to 0.75. From the chart we find that the answer is 0.2734. • Now add the two areas/probabilities. 0.5 + 0.2734 = 0.7734 There is a 0.7734 probability that a randomly selected healthy adult would have a prophyrin concentration less than 47 mg/dL? Checking Your Answers on the TI-83 or TI-84 (Optional) The normalcdf command (normal cumulative density function command) allows one to determine the probability that a data point will fall within an interval for a known normal distribution. This command is found using the DISTR button. normalcdf(lower limit, upper limit, mean, standard deviation) In part (a) of example 4, we wanted to find the probability that a randomly selected healthy adult would have a prophyrin concentration between 43 mg/dL and 54 mg/dL? To do this we enter normalcdf(43, 54, 38, 12) into the calculator. It generates the probability 0.2472. This is very close to the 0.2454 we worked out by hand. The calculator actually produced a more accurate answer because we had to round off our z-scores to two decimal points when working things out by hand. For questions where there is only one endpoint, it is recommended that one go 5 (or more) standard deviations above or below the mean. This happened in part (b) of example 4 where we had to find the probability that a randomly selected healthy adult would have a prophyrin concentration less than 47 mg/dL. Five standard deviation below the mean is -22 (38 - 5 × 12). We would enter normalcdf(-22, 47, 38, 12) into the calculator. It generates the probability 0.7734. NSSAL ©2011 75 Draft C. D. Pilmer Questions 1. A population, which results in a bell-shaped distribution, has a mean of 42.7 and standard deviation of 7.9. How many standard deviations from the mean is each of these data points? (a) 37.6 (b) 53.2 2. It may surprise you but professors at universities do not spend all their time teaching graduate and undergraduate students. A significant amount of time is spent on research. So what percentage of time do professors spend teaching and on teaching-related activities? The NEA Almanac of Higher Education reported that the mean percentage of time spent on teaching activities is about 51% with a standard deviation of 25%. If we are dealing with a bell-shaped distribution, determine the z-scores corresponding to the following professors' percentage of time devoted to teaching activities. (a) Dr. B. Pletner, 68% (b) Dr. R. Dawson, 43% 3. An NSCC instructor examined the results from a common exam offered at all campuses. She discovered that the marks were normally distributed. She calculated the z-scores for her six learners. These are shown below. Tylena, 0.93 Hamid, -1.13 Meera, -0.42 Beverly, 0.00 Elliott, 1.27 Marcus, 0.58 (a) Which of these learners scored above the mean? (b) Which of these learners scored below the mean? (c) Which of the learner scored on the mean? (d) Which of her learners obtained the best mark? Based on the information provided, can you determine the mark? (e) Can you tell if every one of her learners passed the test? Explain. NSSAL ©2011 76 Draft C. D. Pilmer 4. The concentration of red blood cells in whole blood is measured in millions per cubic millimetre. Within the population of healthy females, the red blood cell concentration is normally distributed with a mean of 4.8 million/mm3 and a standard deviation of 0.3 million/mm3. (Hint: Each of these five questions corresponds to the five cases we described earlier for area under the standard normal curve. You may wish to draw the standard normal curve as was done in the worked examples to assist you with each part of this question.) (a) What is the probability that a randomly selected healthy female would have a red blood cell concentration between 4.8 and 5.3 million/mm3? (b) What is the probability that a randomly selected healthy female would have a red blood cell concentration between 4.4 and 5.0 million/mm3? (c) What is the probability that a randomly selected healthy female would have a red blood cell concentration between 5.2 and 5.5 million/mm3? (d) What is the probability that a randomly selected healthy female would have a red blood cell concentration less than 4.6 million/mm3? (e) What is the probability that a randomly selected healthy female would have a red blood cell concentration greater than 4.3 million/mm3? NSSAL ©2011 77 Draft C. D. Pilmer 5. A community examined the response times of their police department over a three year period. They discovered that the distribution of response times was bell-shaped and that the mean response time was 8.2 minutes with a standard deviation of 1.9 minutes. For a randomly received emergency call to the police department in that three year period, what is the likelihood that the response time will be: (a) greater than 8.2 minutes? (b) between 6.0 and 8.2 minutes? (c) less than 9.3 minutes? (d) between 6.4 and 7.7 minutes? (e) between 4.2 and 8.8 minutes? (f) greater than 9.7 minutes? NSSAL ©2011 78 Draft C. D. Pilmer 6. A consumer magazine reports that the average life of a refrigerator before replacement is 14 years with a standard deviation of 2.5 years. Assume that the distribution of refrigeration life spans is approximately normal. What is the probability that someone will keep a refrigerator: (a) between 11 years and 16 years? (b) greater than 15 years? (c) less than 14 years? (d) between 10 years and 13 years? (e) greater than 12 years? (f) between 8 years and 14 years? NSSAL ©2011 79 Draft C. D. Pilmer Growth Charts One of the most common uses of standard deviations is in the production of growth charts used in the health sciences. These charts show the wide range of values for a particular measurement (e.g. weight, height, head circumference,…) for different ages. Normally we would use standard deviation to describe the spread of these measurements, but many growth charts use percentiles. Although the charts use percentiles, it is important to note that standard deviations were used in the construction of these percentiles. Each standard deviation represents a fixed percentile. For example −3 σ is the 0.13th percentile, −2 σ the 2.28th percentile, −1 σ the 15.87th percentile, 0 σ the 50th percentile, +1 σ the 84.13th percentile, +2 σ the 97.72th percentile, and +3 σ the 99.87th percentile. You are not expected to know these values. Growth charts don't use percentiles like 0.13, 2.28 or 15.87, rather they use whole numbers like 3, 5, 10, 25, and so on. Source: Wikimedia Commons, Author: Mwtoews Percentiles rank the position of an individual by indicating what percent of the reference population the individual would equal or exceed. For example, on the weight growth charts, a 30-month-old boy whose weight is at the 25th percentile, weighs the same or more than 25 percent of the reference population of 30-month-old boys, and weighs less than 75 percent of the 30-month-old boys in the reference population. It is important to understand that the growth charts are best used to follow a child's growth over time or to find a pattern of his/her growth. Should one be concerned if a child consistently is in a low percentile for a particular measure? For example, should a parent be concerned if from the ages of 10 months to 32 months their girl ranks between the 5th and 10th percentile for weight? The answer is no; she is exhibiting normal growth. Should one be concerned with a sudden drop or sudden increase in a percentile value for a particular measure? For example, should a parent be concerned if their son dropped from the 90th percentile for weight at the age of 6 months to the 25th percentile at the age of 12 months? The answer is yes; such a large drop may indicate a problem. On the growth charts we will be using, there are nine lines/curves. The bottom line represents the 3rd percentile and the top line represented the 97th percentile. The other lines from top to bottom are the 5th, 10th, 25th, 50th, 75th, 90th, and 95th percentile. We have included these growth charts in the appendix, found at the end of this resource. We will need to use these charts to answer all the questions in this section. All of these charts are from the 2000 CDC Growth Charts for the United States: Methods and Development (Kuczmarski RJ, Ogden CL, Guo SS, et al. 2000 CDC growth charts for the United States: Methods and development. National Center for Health Statistics.Vital Health Stat 11(246). 2002). We should apologize ahead of time that we have only supplied growth charts for boys. The growth charts for boys are blue and those for NSSAL ©2011 80 Draft C. D. Pilmer girls are pink. Unfortunately charts in pink do not reproduce well in a black and white resource so we had to omit them. Example 1 Using the weight growth chart for boys, answer the following. (a) In what percentile is a 3 month year old boy weighting 12 pounds (or 5.44 kg). What does this percentile mean? (b) What weight would one expect for a four month old boy who is in the 75th percentile for weight? (c) What range of weights would one expect for two month old boys who are between the 3rd and 97th percentile for weight? (d) What range of ages would one expect for boys whose weights are 12 pounds yet stay within the 3rd and 97th percentile for their age? Answers: (a) On the vertical axis, find 12 pounds and on the horizontal axis, find 3 months. Plot the point (3, 12) on the coordinate system. This point intersects the fourth curve from the bottom; (i.e. the 25th percentile curve). It means that this 3 month old 12 pound boy weights as much or more than 25 percent of the boys of the same age. NSSAL ©2011 81 (b) On the horizontal axis, find 4 months. Move up until we intersect the sixth curve from the bottom (i.e. the 75th percentile curve). This point corresponds with a weight of 16 pounds (or approximately 7.23 kg). Draft C. D. Pilmer (c) A two month old boy in the 3rd percentile would only weigh approximately 8.8 pounds. A two month old boy in the 97th percentile weighs approximately 14.5 pounds. Therefore we would expect that weights between 8.8 pounds and 14.5 pounds would cover all two month old boys between the 3rd and 97th percentile. (d) A one month old boy could weigh as much as 12 pounds if he is in the 97th percentile. A boy a little more than 4 month old could weigh as little as 12 kg if he is in the 3rd percentile. Therefore, boys between 1 month and a little more than 4 months of age could weigh 12 pounds yet still be within the 3rd and 97th percentile for their age. Questions 1. In what percentile for head circumference is a 12 month old boy with a head circumference of 46.2 cm? Explain what this percentile means. 2. In what percentile for length is a 31 month old boy with a length of 99 cm (or 39 inches). Explain what this percentile means. NSSAL ©2011 82 Draft C. D. Pilmer 3. For each case, determine the percentile ranking. (a) 33 month old boy, length = 36 inches (b) 21 month old boy, weight = 31 pounds (c) 30 month old boy, weight = 26 pounds (d) 23 month old boy, head circumference = 19.5 inches (e) 10 month old boy, length = 28.5 inches (f) 33 month old boy, head circumference = 19.75 inches (or approximately 51 cm) (g) 10 month old boy, weight = 24.5 pounds (or approximately 11.3 kg) (h) 28 month old boy, length = 33.5 inches (or approximately 86 cm) 4. For each case, determine the measure. (a) What weight would one expect for a twelve month old boy who is in the 5th percentile for weight? (b) What length would one expect for a 20 month old boy who is in the 50th percentile for length? (c) What head circumference would one expect for a 10 month old boy who is in the 97th percentile for head circumference? 5. What range of lengths would one expect for 15 month old boys who are between the 3rd and 97th percentile for length? 6. What range of head circumferences would one expect for 30 month old boys who are between the 3rd and 97th percentile for head circumference? NSSAL ©2011 83 Draft C. D. Pilmer 7. What range of ages would one expect for boys whose lengths are 31 inches yet stay within the 3rd and 97th percentile for their age? 8. What range of ages would one expect for boys whose head circumferences are 16.25 inches yet stay within the 3rd and 97th percentile for their age? 9. What range of weights would one expect for 33 month old boys who are between the 25th and 75th percentile for weight? 10. What range of lengths would one expect for 22 month old boys who are between the 10th and 90th percentile for length? 11. What range of ages would one expect for boys whose weights are 21 pounds yet stay within the 5th and 90th percentile for their age? 12. What range of ages would one expect for boys whose lengths are 29 inches yet stay within the 25th and 75th percentile for their age? 13. Look at the weights of a particular boy over a 12 month period. Do you have concerns regarding his weight? Explain. Months Weight (kg) NSSAL ©2011 0 4.55 2 5.89 4 6.80 84 6 7.58 8 7.82 10 8.16 12 8.42 Draft C. D. Pilmer Putting It Together In this unit we looked at the following. • Populations and Samples • Categorical and Numerical Data • Bar Graphs, Double Bar Graphs, Stacked Bar Graphs, Histogram, Circle Graphs and Line Graphs • Mean, Trimmed Mean, Median, and Mode • Box and Whisker Plots (with and without technology) • Standard Deviation (with and without technology) • Distributions (Normal, Skewed, Bimodal, Uniform) • The 68-95-99.7 Rule for Normal Distributions • Z-Scores • Growth Charts Questions: 1. The manager of the community sportsplex wanted to know how the 1386 members might feel about the discussion concerning an addition to the existing building that included a 25 metre, 8 lane pool. He asked 230 randomly selected members if they were willing to pay an additional $35 a year on their membership fee to have these new features. Describe the population and the sample for this situation. 2. For each of the following, state whether the data collection would result in a categorical data set or numerical data set. If the data is numerical, indicate whether we are dealing with discrete or continuous data. (a) The number of pets in Nova Scotian households (b) The type of MP3 player owned by adults. (c) The diameter of the trunk of spruce trees growing in a particular valley. (d) The size of T-shirts worn by boys between the ages of 16 and 18 years (e) The number of children traveling more than 1.5 kilometres to school. (f) The time to complete a driver’s license renewal at a specific Access Nova Scotia location NSSAL ©2011 85 Draft C. D. Pilmer 3. The 5-year survival rates for six different types of cancers have been supplied in the graph below. 100 90 Survival Rate % 80 70 60 1992 to 1994 50 2004 to 2006 40 30 20 10 Br ai n O va ry or ec ta l Co l Br ea st Sk in M el an om a Pr os ta te 0 Source: Canadian Cancer Registry (a) What was the approximate survival rate for colorectal cancer between 1992 and 1994? (b) What was the approximate survival rate for breast cancer between 2004 and 2006? (c) By approximately how much did the survival rate for ovarian cancer improve from 19921994 to 2004-2006? (d) If approximately 22 200 Canadian women were diagnosed with breast cancer in 2006, then how many are expected to survive? (e) What type of graph (bar, double bar, stacked bar, circle,…) are we dealing with here? (f) Can you conclude that there were fewer cases of brain cancer than prostate cancer based on this graph? Why or why not? NSSAL ©2011 86 Draft C. D. Pilmer 4. A major fast food chain that specializes in pizzas had all its store report on the topping selected by all customers for their pizzas. This data was used to construct the circle graph below. It is also important to know that this chain sold 564 000 pizzas over a one year period amongst all of their establishments. other 6% onions 4% mushroom 14% pepperoni 42% sausage 19% vegetable 15% (a) Are we dealing with a sample or a population? Explain. (b) What percentage of customers ordered vegetables on their pizza? (c) What percentage of customers ordered sausage and/or onion on their pizzas? (d) What percentage of customers ordered sausage and onion on their pizzas? (e) How many pizzas with pepperoni topping were sold during this year? (f) How many pizzas with sausage and/or mushroom toppings were sold during this year? (g) What is the ratio of pizzas with mushroom toppings to pizzas with pepperoni toppings? (h) There were 107 160 pizzas with a particular topping. What topping was it? NSSAL ©2011 87 Draft C. D. Pilmer 5. The following graph shows the number of infant deaths in Canada from 1999 to 2007. 1,900 Number of Infant Deaths 1,880 1,860 1,840 1,820 1,800 1,780 1,760 1,740 1,720 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 Year Source: Statistics Canada What are your thoughts regarding the scale used on the vertical axis of this line graph? 6. Below you have been provided with data tables. Indicate what type of graph (histogram, line, circle, bar, double bar, or stacked bar graph) you would use for this data. (a) Graph Type: ___________________ Brand of Car Toyota GM Honda Ford Chrysler Volkswagen Hyundai Other NSSAL ©2011 Canadian Market Share (Sept 2011) 9.9% 12.8% 8.1% 16.7% 15.8% 4.8% 13.1% 18.8% (b) Graph Type: ___________________ Canadian Policereported Crimes Impaired Driving Abduction Arson Counterfeiting Theft over $5000 Fraud Uttering Threats Extortion 88 2008 2009 84 759 464 13 270 1015 16 743 90 932 78 500 1385 88 630 429 13 372 798 15 573 90 623 78 407 1701 Draft C. D. Pilmer (c) Graph Type: ___________________ Cause for Lateness Snoozing after Alarm Car Problems Missed Public Transit Family Crisis Stuck in Traffic Other (e) Frequency 83 23 47 62 113 59 Graph Type: ___________________ Time 0 1 2 3 4 5 (d) Height of Projectile in Metres 2.0 22.1 32.4 32.9 23.6 4.5 Graph Type: ___________________ Mean Amount of Sleep in Hours 5-6 6-7 7-8 8-9 9 - 10 (f) Number of People 26 74 103 57 21 Graph Type: ___________________ Department Jan Feb Mar Profit Profit Profit ($) ($) ($) Automotive 4045 5612 6289 Toys 2045 2549 3283 Electronics 6845 2248 1867 Sporting G. 2567 1217 1506 Footwear 4753 5608 6099 Men's 1598 2286 1894 Women's 3725 4589 4635 7. An airline company randomly selected eighteen suitcases from domestic flights and recorded their weights in kilograms. 16.2 11.3 15.7 14.7 15.1 19.6 16.0 14.1 3.9 18.0 14.8 16.3 13.6 11.9 12.4 14.8 13.5 19.7 (a) Although the airline collected a sample, describe the population in this situation. (b) Would a histogram or bar graph be used with this data set? (c) Calculate the mean, median, mode, and 5% trimmed mean without using the STAT feature on a TI-83/84 calculator. NSSAL ©2011 89 Draft C. D. Pilmer 8. Mr. Tetford's and Mrs. Gatien's learners wrote the same math test. The test was out of 30. The results for the two classes are shown below. Mr. Tetford's Class 26 26 29 22 23 19 25 27 23 27 24 20 25 Mrs. Gatien's Class 25 27 23 21 23 22 20 24 20 30 21 24 20 22 (a) Construct box and whisker plots for each set of data without using a graphing calculator. 5 10 15 20 25 30 (b) What range of marks would place a learner in the top 50% of Mr. Tetford's class? (c) What range of marks would place a learner in the bottom 25% of Mrs. Gatien's class? (d) What range of marks would place a learner in the top 25% of the Mrs. Gatien's class? (e) How do the two classes compare in terms of marks on this math test? NSSAL ©2011 90 Draft C. D. Pilmer 9. A study looked at the concentration of iron in the bloodstream of ten randomly selected high performance female athletes. The following data was collected. The concentrations are measured in grams per decilitre (g/dl). 15.3 14.2 13.6 11.9 14.8 12.6 14.6 13.9 14.2 12.9 (a) Are we dealing with a population or a sample? (b) Calculate the mean without using the STAT features on your calculator. Use the appropriate symbol. (c) Calculate the standard deviation without using the STAT features on your calculator.. xi 10. If you were collecting a random sample in each situation, what type of distribution (normal, uniform, bimodal, skewed) would you likely obtain? Distribution Type (a) Hodgkin’s lymphoma is a type of cancer that originates from white blood cells. This disease typically affects people either in early adulthood or when they are 55 years of age or older. You randomly select 250 patients with Hodgkin’s lymphoma and ask them to report the age of their initial diagnosis. What would the distribution of ages likely look like? (b) Most people make under $40,000 a year, but some make quite a bit more, with a smaller number making many millions of dollars a year. What would the distribution of yearly earnings likely look like? (c) James is working as a biologist for the summer and measuring the circumferences of randomly selected maple trees in a natural growth forest. What would the distribution of circumferences likely look like? NSSAL ©2011 91 Draft C. D. Pilmer Distribution Type (d) You use the random number generator on your calculator to find 500 random whole numbers between 1 and 10. What would the distribution of numbers likely look like? 11. The body mass index of all 6000 new recruits to the armed forces were taken. The mean was 23.0 kg/m2 and the standard deviation 2.5 kg/m2. Assume that the distribution of body mass indexes was bell-shaped. (Hint: Use the 68-95-99.7% rule to solve these questions, rather than z-scores and the standard normal curve.) (a) How many new recruits had body mass indexes between 23.0 kg/m2 and 25.5 kg/m2? (b) How many new recruits had body mass indexes between 18.0 kg/m2 and 23.0 kg/m2? (c) How many new recruits had body mass indexes between 15.5 kg/m2 and 30.5 kg/m2? (d) How many new recruits had body mass indexes between 20.5 kg/m2 and 28.0 kg/m2? (e) How many new recruits had body mass indexes between 18.0 kg/m2 and 30.5 kg/m2? (f) How many new recruits had body mass indexes between 15.5 kg/m2 and 25.5 kg/m2? (g) How many new recruits had body mass indexes between 25.5 kg/m2 and 28.0 kg/m2? NSSAL ©2011 92 Draft C. D. Pilmer (h) How many new recruits had body mass indexes between 15.5 kg/m2 and 18.0 kg/m2? (i) How many new recruits had body mass indexes greater than 23.0 kg/m2? (j) How many new recruits had body mass indexes greater than 20.5 kg/m2? (k) How many new recruits had body mass indexes less than 28.0 kg/m2? (l) How many new recruits had body mass indexes greater than 25.5 kg/m2? (m) How many new recruits had body mass indexes less than 18.0 kg/m2? 12. Data collected over the last 100 years indicates that the average daily temperature for a particular location in August is 26oC with a standard deviation of 3oC. If we are dealing with a bell-shaped distribution, determine the z-scores corresponding to each of these temperatures. (a) 31oC (b) 24oC NSSAL ©2011 93 Draft C. D. Pilmer 13. Scores on the Wechsler Adult Intelligence Scale (i.e. an IQ test) for 20 to 34 year old adults are approximately normal with a mean of 110 and a standard deviation of 25. For a randomly selected adult within that age group, determine (without using a graphing calculator) the likelihood that their IQ will be: (a) between 104 and 128? (b) between 80 and 110? (c) greater than 110? (d) less than 132? (e) between 90 and 107? (f) greater than 150? NSSAL ©2011 94 Draft C. D. Pilmer 14. In what percentile for head circumference is a 11 month old boy with a head circumference of 44.4 cm? Explain what this percentile means. 15. What weight would one expect for a 24 month old boy who is in the 25th percentile for weight? 16. What range of lengths would one expect for 28 month old boys who are between the 3rd and 97th percentile for lengths? 17. What range of ages would one expect for boys whose lengths are 25 inches yet stay within the 3rd and 97th percentile for their age? 18. What range of head circumferences would one expect for 25 month old boys who are between the 10th and 90th percentile for head circumference? NSSAL ©2011 95 Draft C. D. Pilmer Areas Under the Normal Curve (z-Table) The values inside the table represent the areas under the normal curve for values between 0 and a z-score. For example, to determine the area under the curve between 0 and 1.37, look in the intersecting cell for the row labeled 1.3 and the column labeled 0.07. The area is 0.4147. z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.0359 0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753 0.2 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141 0.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517 0.4 0.1554 0.1591 0.1628 0.1664 0.1700 0.1736 0.1772 0.1808 0.1844 0.1879 0.5 0.1915 0.1950 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157 0.2190 0.2224 0.6 0.2257 0.2291 0.2324 0.2357 0.2389 0.2422 0.2454 0.2486 0.2517 0.2549 0.7 0.2580 0.2611 0.2642 0.2673 0.2704 0.2734 0.2764 0.2794 0.2823 0.2852 0.8 0.2881 0.2910 0.2939 0.2967 0.2995 0.3023 0.3051 0.3078 0.3106 0.3133 0.9 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.3340 0.3365 0.3389 1.0 0.3413 0.3438 0.3461 0.3485 0.3508 0.3531 0.3554 0.3577 0.3599 0.3621 1.1 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.3770 0.3790 0.3810 0.3830 1.2 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.3980 0.3997 0.4015 1.3 0.4032 0.4049 0.4066 0.4082 0.4099 0.4115 0.4131 0.4147 0.4162 0.4177 1.4 0.4192 0.4207 0.4222 0.4236 0.4251 0.4265 0.4279 0.4292 0.4306 0.4319 1.5 0.4332 0.4345 0.4357 0.4370 0.4382 0.4394 0.4406 0.4418 0.4429 0.4441 1.6 0.4452 0.4463 0.4474 0.4484 0.4495 0.4505 0.4515 0.4525 0.4535 0.4545 1.7 0.4554 0.4564 0.4573 0.4582 0.4591 0.4599 0.4608 0.4616 0.4625 0.4633 1.8 0.4641 0.4649 0.4656 0.4664 0.4671 0.4678 0.4686 0.4693 0.4699 0.4706 1.9 0.4713 0.4719 0.4726 0.4732 0.4738 0.4744 0.4750 0.4756 0.4761 0.4767 2.0 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817 2.1 0.4821 0.4826 0.4830 0.4834 0.4838 0.4842 0.4846 0.4850 0.4854 0.4857 2.2 0.4861 0.4864 0.4868 0.4871 0.4875 0.4878 0.4881 0.4884 0.4887 0.4890 2.3 0.4893 0.4896 0.4898 0.4901 0.4904 0.4906 0.4909 0.4911 0.4913 0.4916 2.4 0.4918 0.4920 0.4922 0.4925 0.4927 0.4929 0.4931 0.4932 0.4934 0.4936 2.5 0.4938 0.4940 0.4941 0.4943 0.4945 0.4946 0.4948 0.4949 0.4951 0.4952 2.6 0.4953 0.4955 0.4956 0.4957 0.4959 0.4960 0.4961 0.4962 0.4963 0.4964 2.7 0.4965 0.4966 0.4967 0.4968 0.4969 0.4970 0.4971 0.4972 0.4973 0.4974 2.8 0.4974 0.4975 0.4976 0.4977 0.4977 0.4978 0.4979 0.4979 0.4980 0.4981 2.9 0.4981 0.4982 0.4982 0.4983 0.4984 0.4984 0.4985 0.4985 0.4986 0.4986 3.0 0.4987 0.4987 0.4987 0.4988 0.4988 0.4989 0.4989 0.4989 0.4990 0.4990 NSSAL ©2011 96 Draft C. D. Pilmer NSSAL ©2011 97 Draft C. D. Pilmer NSSAL ©2011 98 Draft C. D. Pilmer NSSAL ©2011 99 Draft C. D. Pilmer Post-Unit Reflections What is the most valuable or important thing you learned in this unit? What part did you find most interesting or enjoyable? What was the most challenging part, and how did you respond to this challenge? How did you feel about this topic when you started this unit? How do you feel about this topic now? Of the skills you used in this unit, which is your strongest skill? What skill(s) do you feel you need to improve, and how will you improve them? How does what you learned in this unit fit with your personal goals? NSSAL ©2011 100 Draft C. D. Pilmer Answers Populations and Samples (pages 1 to 2) 1. Population: all the taxpayers in this community (4127) Sample: the 300 randomly selected taxpayers 2. Population: all the used bricks that the contractor purchased (6000) Sample: the 200 randomly selected bricks that were examined to determine usability 3. Population: all of the employed workers in Nova Scotia (453 000) Sample: the 1200 randomly selected employed workers who participated in the survey and reported their annual gross income 4. Population: all of the adults who received a high school diploma from NSSAL between 2001 and 2009 Sample: the 240 randomly selected NSSAL graduates who participated in the interview Tables (pages 3 to 4) 1. Star Wars: Episode 0 2. Star Wars: Episode 0 3. Terminator: Rise of the Toasters 4. Jaws: The Teething Years 5. Transformers: The Horse and Buggy Years 6. A graph of some fashion 7. It is far easier to use this graph to answer the questions on the previous page. 8. Population Types of Data (pages 5 to 6) 1. (a) (c) (e) (g) (i) (k) numerical (continuous) categorical numerical (discrete) categorical numerical (discrete) numerical (discrete) NSSAL ©2011 (b) (d) (f) (h) (j) (l) 101 categorical numerical (continuous) numerical (continuous) numerical (continuous) numerical (continuous) categorical Draft C. D. Pilmer (m) numerical (continuous) (n) categorical Bar Graphs and Histograms (pages 7 to 14) 1. (a) (b) (c) (d) (e) baseball approximately 78 million fans football little less than 20 million fans bar graph 2. (a) (b) (c) (d) (e) (f) (g) (h) double bar graph Germany 37 medals Norway 2 medals 7 medals 116 medals 121 medals 3. (a) (b) (c) (d) (e) (f) histogram numerical, continuous approximately 52 000 RNs (24 000 + 28 000) approximately 18 000 RNs (36 000 - 18 000) three classes: 45 to 49 years, 50 to 54 years, and 55 to 59 years shortage of RNs in the future 4. (a) (b) (c) (d) (e) (f) stacked bar graph no little more than 1300 cases approximately 550 cases approximately 300 cases (850 - 550) consult visits 2005/2006: 460 (540-80) consult visits 2006/2007: 660 (750-90) 660 - 460 = 200 cases (g) inpatient days decreased significantly but consult visits increased by a similar amount 5. (a) NSSAL ©2011 102 Draft C. D. Pilmer 2 (b) 16 % 3 (c) numerical, continuous (d) sample Circle Graphs and Line Graphs (pages 15 to 19) 1. (a) (b) (c) (d) automobile accidents 3 times 288 60 12 (e) (ii) 7 (f) home injuries 2. (a) Jan - Feb 08, Aug - Sept 08, Jan - Feb 09, Jan - Feb 10, Oct - Nov 10 (b) Oct 08 (c) May 09 (d) $15 000 million ($15 billion) 3. (a) 40% 7 (b) 12 (c) 242 starts (d) 340 4. (a) 13th day, $7.40 (b) $11.40 per share (c) 15th day, $2.50 per share First Impression/Second Impressions (pages 20 to 23) (More detailed responses are required than what is supplied below.) Part 1 - The perspective of the circle graph that was initially presented can lead one to believe that the three brands of ice cream are favored equally; this is not the case. Part 2 - One may initially assume that the population of Trois-Rivieres is 4 to 5 times that of Lethbridge if one did not consider the scale on the vertical axis. On the first bar graph, the vertical axis starts at 50 000, rather than 0 (as it does on the second graph). Part 3 - Because the first graph deals with percentages, we do know what percentage of patrons for each ride were male and female. However, we are unable to see how the rides compared to NSSAL ©2011 103 Draft C. D. Pilmer each other in terms of attracting patrons. This only occurred when we were able to examine the second graph which plotted number of people on the vertical axis. Part 4 - The first graph may have made individuals believe that the average price of a domestic airfare was fluctuating wildly. This occurs when one fails to look at the scale on the vertical axis. In the first graph, the scale starts at $160, rather than $0 (as it does in the second graph). What Type of Graph Should Be Used? (pages 24 to 25) 1. Double Bar Graph (or Stacked Bar Graph) 2. Circle Graph (or Bar Graph) 3. Line Graph 4. Histogram 5. Stacked Bar Graph 6. Circle Graph (or Bar Graph) 7. Bar Graph 8. Double Bar Graph 9. Histogram 10. Line Graph Mean, Median, Mode, and Trimmed Mean (pages 26 to 33) 1. (a) sample (b) x = 6.2 Median = 6 Mode = 7 (c) There are no outliers. 2. (a) population (b) numerical (c) µ = 159.44 Median = 157 No Mode 3. (a) sample (b) x = 35 (34.6) 5% Trimmed Mean 10% Trimmed Mean NSSAL ©2011 Median = 31 x(T ) = 31 (30.6) Mode = 23 and 27 (bimodal) x(T ) = 31 (30.9) 104 Draft C. D. Pilmer (c) Trimmed means are appropriate because the outlier 115 exists within the data set. (d) Four data points from the bottom and four data points from top of the data set 4. (a) x = 268 (267.875) Median = 254 (253.5) (b) Median and Trimmed Mean (c) Histogram Mode = 267 x(T ) = 255 (255.409) 5. This score system was likely implemented to eliminate the effect of a single rogue judge who would inflate or deflate the score of a particular athlete. Box and Whisker Plots (pages 34 to 40) 1 (a) minimum: 6 lower quartile: 11 median: 17 upper quartile: 21 maximum: 30 (b) minimum: 33 lower quartile: 40 median: 44 upper quartile: 48 maximum: 52 (c) minimum: 24 lower quartile: 25.5 median: 30 upper quartile: 35 maximum: 40 (d) minimum value: 28 lower quartile: 35 median: 36.5 upper quartile: 38 maximum: 41 2. (a) minimum: 7 lower quartile: 10.5 median: 18 upper quartile: 20.5 maximum: 22 (b) The median, upper quartile and maximum for Mr. Porter's class are equal to those for Mr.Churchill's class. That means that in both classes student with slower reaction times (i.e. worse than the median) were performing at the approximately the same level. When we compared students with faster reaction times (i.e. better than the median), however, we notice a difference between the two classes. Because Mr. Churchill's class has a NSSAL ©2011 105 Draft C. D. Pilmer smaller minimum and lower quartile, we can say that his faster reaction time students in general out-performed Mr. Porter's faster reaction time students. (c) Mrs. Lowe's Class minimum: 6 lower quartile: 10 median: 14 upper quartile: 18 maximum: 20 Mr. Vroom's Class minimum: 6 lower quartile: 15 median: 18 upper quartile: 23 maximum: 23 With the exception of the minimum, all other values are lower (faster reaction times) for Mrs. Lowe's class. That means that the majority of Mrs. Lowe's students out-performed Mr. Vroom's students in the reaction time experiment. (d) Mrs. Burchill's Class minimum: 5 lower quartile: 10 median: 12.5 upper quartile: 16 maximum: 21 Mr. Rhodenizer's Class minimum: 6 lower quartile: 9 median: 13 upper quartile: 16 maximum: 22 The two box-and-whisker plots are very similar. One can conclude that the students performed at about the same level on the reaction time experiment. Using Technology to Make Box-and-Whisker Plots (pages 41 to 45) 1. (a) Tanya minimum: 2 lower quartile: 8 median: 20 upper quartile: 24 maximum: 25 (b) Tanya's Mean: 16.2 (c) Tanya Class 0 to 5 5 to 10 10 to 15 15 to 20 20 to 25 25 to 30 30 to 35 NSSAL ©2011 Frequency 3 1 2 1 5 3 0 Barb minimum: 6 lower quartile: 12 median: 17 upper quartile: 20 maximum: 25 Suzette minimum: 4 lower quartile: 7 median; 10 upper quartile: 21 maximum: 30 Barb's Mean: 15.9 Suzette's Mean: 13.9 Barb Class 0 to 5 5 to 10 10 to 15 15 to 20 20 to 25 25 to 30 30 to 35 106 Frequency 0 2 4 4 3 1 0 Suzette Class 0 to 5 5 to 10 10 to 15 15 to 20 20 to 25 25 to 30 30 to 35 Frequency 1 7 2 1 2 2 1 Draft C. D. Pilmer (d) Tanya (e) Tanya (f) Barb (g) 24 to 25 points (h) 6 to 12 points (i) 10 to 30 points 2. (a) Mean Time: 12.0 (b) minimum: 10.6 lower quartile: 11.2 median: 12.05 upper quartile: 12.5 maximum: 16.2 (c) Class 10 to 11 11 to 12 12 to 13 13 to 14 14 to 15 15 to 16 16 to 17 Frequency 4 10 12 3 0 0 1 (d) no (e) 10.6 to 12.05 seconds (g) 10.6 to 11.2 seconds (h) Track Meet A 3. Class A minimum: 20.2 lower quartile: 23.5 median: 26.85 upper quartile: 28.1 maximum: 29.4 (f) 12.5 to 16.2 seconds Class B minimum: 17.2 lower quartile: 19.2 median: 22.15 upper quartile: 27.7 maximum: 32.7 Although the median for Class B is much lower (and in the normal range), we have far more extremes in this class. There are a significant number in Class B that are underweight or obese; that is why the box and whiskers are so much larger when plotting this classes BMI data. For Class A the data is more clustered together with all individual being found within the normal and overweight range, although more than half are in the overweight category. Standard Deviation (pages 46 to 50) 1. σ = 2.89 2. σ = 0.41 NSSAL ©2011 107 Draft C. D. Pilmer 3. (a) σ = 1.49 and σ = 2.49 (b) The standard deviation is lower for the first data set. That means this data is not as spread out as the data in the second data set. 4. (a) (b) (c) (d) (e) 183 182 numerical data set σ = 4.90 The average heights of these two groups of learners are the same however the standard deviation for Barb’s group is much lower. That means that there is less variation in heights between Barb’s male learners compared to the other instructor’s learners. The heights of her learners are more clustered around the mean. (f) The standard deviations are almost the same for the two groups of male learners, however, the mean height for Barb’s group is higher. We can conclude that the average height of male learners in Barb’s math courses is three centimeters more than the third instructor’s male students. The variation in heights between the two groups is essentially the same. 5. Histogram (i) matches with (c). Histogram (ii) matches with (b). Histogram (iii) matches with (d). Histogram (iv) matches with (a). 6. Answers will vary. Using Technology to Calculate Population Standard Deviation (pages 52 to 56) 1. (a) population (b) (c) µ = 14.1 , median: 9.91 , σ = 11.2 (Units: young persons out of 10 000 young persons) (d) The mean is high because the incarceration rate for the Northwest Territories is so much higher than the rates. 2. (a) population (b) NSSAL ©2011 108 Draft C. D. Pilmer (c) (d) (e) (f) µ = 55.6 years σ = 9.5 years median: 54.5 years The data does not cluster well around the mean. 3. (a) (b) (c) (d) (e) µ = 3.6 mmol/L σ = 0.90 mmol/L median: 3.4 mmol/L Most of the patients are clustered in the near optimal and borderline ranges. There are a few who are in desirable range, and even a few more in the high and too high ranges. Distributions (pages 57 to 59) 1. (a) (c) (e) (g) (i) uniform skewed right skewed left normal bimodal (b) (d) (f) (h) (j) bimodal normal uniform skewed left normal Normal Distributions and the 68-95-99.7 Rule (pages 60 to 67) 1. Hint: (a) Between µ − σ and µ + σ (b) Between µ and µ + 2σ (c) Between µ − σ and µ (d) Between µ − 3σ and µ (e) Between µ − 2σ and µ + σ (f) Between µ − σ and µ + 3σ (g) Between µ − 3σ and µ + 2σ (h) Between µ + σ and µ + 2σ (i) Between µ − 3σ and µ − 2σ (j) Between µ + σ and µ + 3σ (k) Less than µ + σ (l) Greater than µ − 2σ (m) Less than µ − σ NSSAL ©2011 Calculation: ----47.5% + 34% 34% + 49.85% 49.85% + 47.5% 47.5% - 34% 49.85% - 47.5% 49.85% - 34% 50% + 34% 47.5% + 50% 50% - 34% 109 Answer: 68% 47.5% 34% 49.85% 81.5% 83.85% 0.9735 0.135 0.0235 0.1585 0.84 0.975 0.16 Draft C. D. Pilmer 2. (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) Hint: Between Between Between Between Between Between Between Between Between Between (k) Less than µ − 2σ µ − 3σ and µ + 3σ µ − σ and µ + σ µ − 2σ and µ µ − σ and µ + 2σ µ + σ and µ + 2σ µ − 2σ and µ + 2σ µ − 3σ and µ − σ µ − 2σ and µ + 3σ µ − 3σ and µ µ + 2σ and µ + 3σ Calculation: ---34% + 47.5% 47.5% - 34% -49.85% - 34% 47.5% + 49.85% -49.85% – 47.5% Percentage: 99.7% 68% 47.5% 81.5% 13.5% 95% 15.85% 97.35% 49.85% 2.35% Answer: 1994 1360 950 1630 270 1900 317 1947 997 47 50% - 47.5% 2.5% 50 Z-Scores (pages 68 to 79) 1. (a) -0.65 (b) 1.33 2. (a) 0.68 (b) -0.32 3. (a) Tylena, Elliott, Marcus (b) Meera, Hamid ` (c) Beverly (d) Elliott, no (e) No, they may have all passed if the mean mark was very high or the majority could have failed if the mean mark was very low. Without the mean and standard deviation we cannot tell who passed and who failed. 4. (a) (b) (c) (d) (e) 0.4525 0.4082 + 0.2486 = 0.6568 0.4901 - 0.4082 = 0.0819 0.5 - 0.2486 = 0.2514 0.5 + 0.4525 = 0.9525 5. (a) (b) (c) (d) (e) (f) 0.5 0.3770 0.5 + 0.2190 = 0.7190 0.3289 - 0.1026 = 0.2263 0.4826 - 0.1255 = 0.6081 0.5 - 0.2852 = 0.2148 6. (a) 0.3849 + 0.2881 = 0.6730 NSSAL ©2011 110 Draft C. D. Pilmer (b) (c) (d) (e) (f) 0.5 - 0.1554 = 0.3446 0.5 0.4452 - 0.1554 = 0.2898 0.2881 + 0.5 = 0.7881 0.4918 Growth Charts (pages 80 to 84) 1. 50th percentile; The head circumference for this 12 month old boy is equal to or greater than the head circumference of 50% of the boys of the same age. 2. 95th percentile; The length of this 31 month old boy is equal to or greater than the length of 95% of the boys of the same age. 3. (a) (b) (c) (d) (e) (f) (g) (h) 25th percentile 90th percentile 10th percentile 75th percentile Between the 25th and 50th percentile Between the 50th and 75th percentile Between the 90th and the 95th percentile Between 5th and 10th percentile 4. (a) 19 pounds (approximately 8.6 kg) (b) 33 inches (approximately 83.7 cm) (c) 19 inches (approximately 48.2 cm) 5. 29 inches (approximately 73.6 cm) to 33.5 inches (approximately 85.1 cm) 6. 18.25 inches (approximately 46.3 cm) to 20.5 inches (approximately 52 cm) 7. 10 to 21 months 8. 1 to 6 months 9. 28.5 pounds (approximately 12.9 kg) to 33 pounds (approximately 15 kg) 10. 32 inches (approximately 81.3 cm) to 35.5 inches (approximately 90.2 cm) 11. 6 to 17 months 12. 9 to 12 months 13. (Hint: Change to Percentiles) Should be concerned; the boy went from 97th percentile for weight at birth to the 3rd percentile for weight by the age of 12 months NSSAL ©2011 111 Draft C. D. Pilmer Putting It Together (pages 85 to 95) 1. Population: all 1386 members of the sportsplex Sample: the 230 randomly selected members 2. (a) Numerical, Discrete (c) Numerical, Continuous (e) Numerical, Discrete (b) Categorical (d) Categorical (f) Numerical, Continuous 3. (a) 56% (b) 87% (c) 4% (d) 19314 (if you use a survival rate of 87%) (e) double bar (f) No, The graph does not show the number of cases. It only shows survival rates. 4. (a) population because all stores had to report toppings selected by all customers. (b) 15% (c) 23% (d) Cannot determine based on the information supplied. (e) 236 880 pizzas (f) 186 120 pizzas 1 (g) 3 (h) sausage 5. The scale used makes one initially feel that there were drastic fluctuations in the number of infant deaths between 2004 and 2007. This is not the case. 6. (a) circle graph (b) double bar graph (c) bar graph (d) histogram (e) line graph (f) stacked bar graph 7. (a) Population: All suitcases on domestic flights (b) Histogram (c) x = 14.5 kg, Median = 14.8 kg, Mode = 14.8, x(T ) = 14.9 kg 8. (a) Mr. Tetford's Class Minimum: 19 Lower Quartile: 22.5 Median: 25 Upper Quartile: 26.5 Maximum: 29 NSSAL ©2011 Mrs. Gatien's Class Minimum: 20 Lower Quartile: 21 Median: 22.5 Upper Quartile: 24 Maximum: 30 112 Draft C. D. Pilmer (b) (c) (d) (e) 25 to 29 20 to 21 24 to 30 Although Mrs. Gatien's class' lowest and highest marks are better than those for Mr. Tetford's class, the middle 50% of her learners obtained marks between 21 and 24, while the middle 50% of Mr. Tetford's learners obtained marks between 22.5 and 26.5 (actually between 23 and 26 because half points were not awarded on the test). Mr. Tetford's class outperformed Mrs. Gatien's class on this particular test. 9. (a) sample (b) 13.8 g/dl (c) 1.01 g/dl 10. (a) Bimodal (c) Normal (b) Skewed (left) (d) Uniform 11. (a) 2040 (c) 5982 (e) 5841 (g) 810 (i) 3000 (k) 5850 (m) 150 (b) (d) (f) (h) (j) (l) 2850 4890 5031 141 5040 960 12 (a) 1.67 (b) -0.67 13. (a) (b) (c) (d) (e) (f) 0.0948 + 0.2642 = 0.3590 0.3849 0.50 0.3106 + 0.5 = 0.8106 0.2881 - 0.0478 = 0.2403 0.5 - 0.4452 = 0.0548 14. 10th percentile; The head circumference for this 11 month old boy is equal to or greater than the head circumference of 10% of the boys of the same age. 15. 26 pounds (or 11.8 kg) 16. 33 inches to 38.5 inches 17. 2 months to approximately 6.7 months 18. 18.5 inches to almost 20 inches NSSAL ©2011 113 Draft C. D. Pilmer

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Descriptive Statistics