Download Inferential Statistics Unit

Inferential Statistics Unit (Level IV Academic Math) Draft NSSAL C. David Pilmer ©2010 (Last Updated: Dec 2011) This resource is the intellectual property of the Adult Education Division of the Nova Scotia Department of Labour and Advanced Education. The following are permitted to use and reproduce this resource for classroom purposes. • Nova Scotia instructors delivering the Nova Scotia Adult Learning Program • Canadian public school teachers delivering public school curriculum • Canadian nonprofit tuition-free adult basic education programs The following are not permitted to use or reproduce this resource without the written authorization of the Adult Education Division of the Nova Scotia Department of Labour and Advanced Education. • Upgrading programs at post-secondary institutions • Core programs at post-secondary institutions • Public or private schools outside of Canada • Basic adult education programs outside of Canada Individuals, not including teachers or instructors, are permitted to use this resource for their own learning. They are not permitted to make multiple copies of the resource for distribution. Nor are they permitted to use this resource under the direction of a teacher or instructor at a learning institution. Acknowledgments The Adult Education Division would like to thank the following university professors for reviewing this resource to ensure all mathematical concepts were presented correctly and in a manner that supported our learners. Dr. David Hamilton (Dalhousie University) Dr. Genevieve Boulet (Mount Saint Vincent University) Dr. Robert Dawson (Saint Mary’s University) The Adult Education Division would also like to thank the following NSCC instructors for piloting this resource and offering suggestions during its development. Charles Bailey (IT Campus) Elliott Churchill (Waterfront Campus) Barbara Gillis (Burridge Campus) Barbara Leck (Pictou Campus) Suzette Lowe (Lunenburg Campus) Floyd Porter (Strait Area Campus) Brian Rhodenizer (Kingstec Campus) Joan Ross (Annapolis Valley Campus) Jeff Vroom (Truro Campus) Table of Contents Introduction…………………………………………………………………………… Tracking Your Progress………………………………………………………………. Negotiated Completion Date…………………………………………………………. Mathematics Multimedia Learning Objects …………………………………………. The Big Picture ………………………………………………………………………. Course Timelines …………………………………………………………………….. ii iii iii iv v vi Introductory Material and Terminology ……………………………………………... Bar Graphs and Histograms …………………………………………………………. Describing Data, Part 1 ………………………………………………………………. Describing Data, Part 2 ………………………………………………………………. Using Technology ……………………………………………………………………. Normal Distribution ………………………………………………………………….. Using the 68-95-99.7 Rule …………………………………………………………… Making Inferences …………………………………………………………………… Collecting a Sample ………………………………………………………………….. Sampling Methods …………………………………………………………………… Simulated Sampling ………………………………………………………………….. Sampling Distribution of the Sample Means ………………………………………… Central Limit Theorem ………………………………………………………………. Point Estimates and Interval Estimators ……………………………………………... Putting It Together …………………………………………………………………… 1 6 11 17 25 29 35 40 41 45 50 53 58 68 79 If You Have Time ……………………………………………………………………. Post-Unit Reflections ………………………………………………………………… Terms, Symbol, and Formulas ……………………………………………………….. TI-83/84 Statistics Information Sheet ……………………………………………….. Answers ……………………………………………………………………………… 88 89 90 92 94 NSSAL ©2010 i Draft C. D. Pilmer Introduction Statistics is the discipline concerned with the collection, the organization, and the analysis of data to draw conclusions or make predictions. Statistics is widely employed in government, business, and the natural and social sciences. In the first few sections of the unit we will focus on descriptive statistics; the branch of statistics that deals with the description of data. In these sections we will use terms such as mean, median, mode, and standard deviation. The latter sections and the majority of this unit will focus on inferential statistics - the branch of statistics in which one makes inferences about population characteristics based on evidence drawn from samples. In these sections we will learn about confidence intervals based on a sample mean. Statistics is used by numerous disciplines (e.g. psychology, education, business, medicine, ecology, anthropology,…). This branch of mathematics impacts directly and indirectly on many aspects of your life. As governments wrestle with social and economic matters, they rely heavily on statistical information so that they can make informed decisions. For this reason, the federal government has a branch solely dedicated to the collection of statistical information. That branch is called Statistics Canada. When pharmaceutical companies are developing new drugs, they use numerous statistical tools to analyze data collected from their nonhuman and human trails. Without these tools they would be unable to access the benefits and risks associated with the new medication. Companies that are manufacturing goods use a variety of statistical tools to monitor quality control. Even the coordination of traffic lights is based on the collection and analysis of statistical information. Statistics is truly woven into every aspect of our lives. Take a few minutes to view the following three minute online video. TED Arthur Benjamin's formula for changing math education http://www.ted.com/talks/lang/eng/arthur_benjamin_s_formula_for_changing_math_education.h tml In this unit, we will not require you to master numerous statistical tools; rather, we will focus on understanding the origins and uses of a few tools. It is important that we do not work blindly through this material. Although the actual mechanics of using these statistical tools may seem easy, understanding their origins and meanings if far more challenging and ultimately the purpose of this unit. We need to think about the new concepts that we are exposed to and how they relate to previous concepts. NSSAL ©2010 ii Draft C. D. Pilmer Tracking Your Progress This page allows you to keep track of your progress through this material. Date Started Introductory Material and Terminology ………….. Bar Graphs and Histograms ……………………… Describing Data, Part 1 …………………………… Describing Data, Part 2 …………………………… Using Technology ………………………………… Normal Distribution ………………………………. Using the 68-95-99.7 Rule ………………………... Collecting a Sample ………………………………. Sampling Methods ………………………………... Simulated Sampling ………………………………. Sampling Distribution of the Sample Means …… Central Limit Theorem …………………………… Point Estimates and Interval Estimators ………….. Putting It Together ………………………………... Date Completed 1 6 11 17 25 29 35 41 45 50 53 58 68 79 Negotiated Completion Date After working for a few days on this unit, sit down with your instructor and negotiate a completion date for this unit. Start Date: _________________ Completion Date: _________________ Instructor Signature: __________________________ Student Signature: NSSAL ©2010 __________________________ iii Draft C. D. Pilmer Mathematics Multimedia Learning Objects In this resource you will find references to the online Mathematics Multimedia Learning Objects. These online learner supports can be found at the following website and be accessed using the following username and password. http://www.cdli.ca/mlo/tutorials/index.php Username: camet Password: camet06 Province: Nova Scotia Please do not view every learning object at this site. Only use those that are identified in this resource. NSSAL ©2010 iv Draft C. D. Pilmer The Big Picture The following flow chart shows the optional bridging unit and the eight required units in Level IV Academic Math. These have been presented in a suggested order. Bridging Unit (Recommended) • Solving Equations and Linear Functions Describing Relations Unit • Relations, Functions, Domain, Range, Intercepts, Symmetry Systems of Equations Unit • 2 by 2 Systems, Plane in 3-Space, 3 by 3 Systems Trigonometry Unit • Pythagorean Theorem, Trigonometric Ratios, Law of Sines, Law of Cosines Sinusoidal Functions Unit • Periodic Functions, Sinusoidal Functions, Graphing Using Transformations, Determining the Equation, Applications Quadratic Functions Unit • Graphing using Transformations, Determining the Equation, Factoring, Solving Quadratic Equations, Vertex Formula, Applications Rational Expressions and Radicals Unit • Operations with and Simplification of Radicals and Rational Expressions Exponential Functions and Logarithms Unit • Graphing using Transformations, Determining the Equation, Solving Exponential Equations, Laws of Logarithms, Solving Logarithmic Equations, Applications Inferential Statistics Unit • Population, Sample, Standard Deviation, Normal Distribution, Central Limit Theorem, Confidence Intervals NSSAL ©2010 v Draft C. D. Pilmer Course Timelines Academic Level IV Math is a two credit course within the Adult Learning Program. As a two credit course, learners are expected to complete 200 hours of course material. Since most ALP math classes meet for 6 hours each week, the course should be completed within 35 weeks. The curriculum developers have worked diligently to ensure that the course can be completed within this time span. Below you will find a chart containing the unit names and suggested completion times. The hours listed are classroom hours. In an academic course, there is an expectation that some work will be completed outside of regular class time. Unit Name Minimum Completion Time in Hours 0 6 18 18 20 36 12 20 20 Total: 150 hours Bridging Unit (optional) Describing Relations Unit Systems of Equations Unit Trigonometry Unit Sinusoidal Functions Unit Quadratic Functions Unit Rational Expressions and Radicals Unit Exponential Functions and Logarithms Unit Inferential Statistics Unit Maximum Completion Time in Hours 20 8 22 20 24 42 16 24 24 Total: 200 hours As one can see, this course covers numerous topics and for this reason may seem daunting. You can complete this course in a timely manner if you manage your time wisely, remain focused, and seek assistance from your instructor when needed. NSSAL ©2010 vi Draft C. D. Pilmer Introductory Materials and Terminology As we learned in the introduction, descriptive statistics is concerned with the description of data. This means that we look at methods that organize data and summarize data in an effective presentation that ultimately increases our understanding of the data. One of the most common tools used in descriptive statistics are pictorial representations such as graphs (e.g. bar graphs, circle graphs, line graphs,…). Answers: (a) 1960 (c) 1.7 4 3.5 3 2.5 2 1.5 1 0.5 0 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 Year (b) 2000 (d) 3.2 − 2.4 = 0.8 Example 2 This circle graph shows the leading causes of death of American women ages 65 years and older. The graph was constructed using information collected regarding the deaths of 50 000 American women 65 years of age. (a) How many women died from heart disease? (b) How many women died from either the flu or a stroke? (c) How many more women died from heart disease than a stroke? Answers: (a) 47% of 50 000 0.47 × 50000 = 23 500 women (b) 7% + 13% = 20% 0.20 × 50000 = 10 000 women (c) 47% − 13% = 34% 0.34 × 50000 = 17 000 women 4.5 Fertility Rate Example 1 This line graph shows how the fertility rate in Canada has changed since 1950. The fertility rate is the average number of children born of women between the ages of 15 and 49. This information was obtained from census data collected by Statistics Canada. (a) In what year was the highest fertility rate? (b) In what year was the lowest fertility rate? (c) Approximate the fertility rate in 1985. (d) Estimate the drop in fertility rate that occurred between 1965 and 1970? flu 7% stroke 13% COPD 7% heart disease 47% cancer 26% Note: Many statisticians discourage the use of pie charts because most people find it difficult to compare the sizes of different pie slices. These statisticians contend that we would be better served using bar graphs. The two examples above were provided for a specific reason. These examples allow us to differentiate between a population and a sample. A population is formally defined as the set representing all measurements of interest to the investigator. In the first example, the investigator, Statistic Canada, wanted to know fertility rates based on the births of every woman NSSAL ©2010 1 Draft C. D. Pilmer between the ages 15 and 49. That is why they used census data to accomplish this. Every person, including women between 15 and 49, are required by law to complete a census. Based on this, Statistics Canada was confident their data represents all measurements of interest. A sample is formally defined as a subset of measurements selected from a population of interest. In the second example, the data was not collected for every death of women 65 years of age or older. The data was from a sample of size 50 000. These 50 000 data points are a subset of the population. It is usually more feasible and less expensive to obtain a sample than to obtain all the measurements from the population. As we learned in the introduction, inferential statistics is about making inference about population characteristics based on evidence drawn from samples. In other words, we try to use a sample to understand a population. We will focus on inferential statistics later in the unit. If you wish further clarification, go to the Mathematics Multimedia Learning Objects (see page iv), access Unit 11-5 Statistics, and view MLO4 Differentiating Populations and Samples. Example 3 The Testing and Evaluation Division of the Department of Education reported that the average mark on the grade 12 provincial math exam was 68%. This average was obtained by randomly selecting 500 exams from throughout the province. Are we dealing with a sample or a population? Explain. Answer: The Testing and Evaluation Division randomly selected 500 exams, rather than every exam. For this reason they were dealing with a sample (i.e. a subset of the population). Types of Data When data is collected from a sample or a population, the responses can be classified as a categorical data set or a numerical data set. These two terms are most easily explained using an example. Suppose we have an adult education class comprised of 10 learners who all have cell phones. The instructor asks two questions and obtains the following responses. Question 1: What cell phone provider do you use? Responses to Question 1: {Telus, Bell Aliant, Telus, Bell Aliant, Rogers, Rogers, Koodo, Rogers, Telus, Rogers} Question 2: What was your cell phone bill for the previous month? Responses to Question 2: {$27.80, $33.50, $45.70, $32.00, $54.90, $29.00, $43.65, $67.40, $35.89, $39.67} The collection of responses to the first question is called a categorical data set. Categorical data is data that can be assigned to distinct non-overlapping categories. The responses to question 1 fit into four categories; Bell Aliant, Koodo, Rogers and Telus. The collection of responses to the NSSAL ©2010 2 Draft C. D. Pilmer second question is called a numerical data set. This is the case because the data is comprised of numbers, specifically different amounts of money. There are two types of numerical data; discrete and continuous. Numerical data is discrete if the possible values are isolated points on a number line. For example, if survey participants were asked how many phone calls they made today, their responses would be whole numbers like 0, 4 or 12. They would not respond with something like 7.8 phone calls. Since they can only report isolated points, then we end up with discrete numerical data. Numerical data is continuous if the set of possible values forms an entire interval on the number line. For example, if soil samples were tested for acidity, the pH could be reported with numbers like 4, 4.17, 4.173, or any other number in the interval. Generally continuous data arises when observations involve making measurements (e.g. weighing objects, recording temperatures, recording time to complete tasks,…) while discrete data arises when observations involve counting. Questions 1. The town’s mayor is interested in knowing what portion of her 4127 taxpayers supports the development of a new recreational center in the community. Because it is too costly to contact all the taxpayers, a survey of 300 randomly selected taxpayers is conducted. Describe the population and sample for this problem. 2. A building contractor just purchased 6000 used bricks. He knows that a small portion of these bricks are cracked and therefore unusable. He randomly selected 200 bricks and discovered that 14 of them were unusable. Describe the population and sample for this problem. 3. A company conducted a phone survey that involved 1200 randomly selected employed workers from Nova Scotia. Each participant had to report their annual gross income. At the time (2009) it was known that there were 453 000 employed workers in Nova Scotia. After conducting the survey and analyzing the data, the company reported an average annual income of 29 900 for the 1200 participants. Describe the population and sample for this problem. NSSAL ©2010 3 Draft C. D. Pilmer 4. Between 2001 and 2009, 3730 adults obtained high school diplomas through the Nova Scotia School for Adult Learning (NSSAL). The Nova Scotia government wanted to know how many of these adults pursued further education after obtaining their diploma. After interviewing 240 randomly selected graduates, it was discovered that 65% had pursued post secondary education primarily at the Nova Scotia Community College. Describe the population and sample for this problem. 5. For each of the following, state whether the data collection would result in a categorical data set or numerical data set. If the data is numerical, indicate whether we are dealing with discrete or continuous data. (a) Concentration in parts per million (ppm) of a particular contaminant in water supplies (b) Brand of personal computer purchased by customers (c) The sex of children born at the IWK Hospital in December (d) The height of male adult education learners at a specific campus (e) The number of children in each household. (f) The gross income of adult workers between the ages of 25 and 35 in Nova Scotia (g) The races of people immigrating to Canada (h) The time it takes for females between the ages of 20 and 30 to complete the 100 m dash (i) The sum of the numbers rolled on two dice 400 6. This bar graph shows the average annual snowfall in six major Canadian cities. 250 200 150 100 50 Va nc ou ve r ga ry Ca l Re gi na To ro nt o ity C ifa x 0 Ha l (b) Approximately how much more snow does Regina get compared to Vancouver? 300 Q ue be c (a) Of the six cities reported, which one has the greatest average annual snowfall? Approximate that average for this city. Average Snowfall (cm) Source: Statistics Canada 350 City NSSAL ©2010 4 Draft C. D. Pilmer (c) When the data was collected prior to creating this bar graph, would the snowfall data be classified as a categorical data set or numerical data set? 7. The municipality wanted to understand how its citizens were commuting to and from work. It was impractical to ask every citizen this question so they decided to conduct a survey where 1100 randomly selected citizens were asked, “What is your primary form of transportation to and from work?” The data was collected and a circle graph was constructed. walk 9% bicycle 7% own vehicle 39% public transit 28% (a) How many people responded carpooling? carpool 17% (b) How many more people responded public transit than bicycle? (c) How many people responded walk or bicycle? (d) Are we dealing with a population or a sample? Explain. (e) Would the collection of responses to this survey question be classified as a categorical or numerical data set? (a) Approximate the participation rate in 2006? 70 60 50 Participation Rate 8. Statistics Canada has been using census data to track employment participation (part-time and full-time) of Canadian females from 1976 to 2006. The graph on the right was constructed using this data. The participation rates are reported as a percentage. 40 30 20 (b) Approximately how much did the participation rate increase by between 1976 and 1991? 10 0 1976 (c) Between what years was there a drop in the participation rate? 1981 1986 1991 1996 2001 2006 Year (d) Are we dealing with a population or sample? Explain. NSSAL ©2010 5 Draft C. D. Pilmer Bar Graphs and Histograms Bar graphs and histograms look very similar so learners often get them confused. Bar graphs are used to display categorical data or discrete numerical data. The bars in bar graphs are separated from one another. Examples of bar graphs are shown below. Bar Graph #1 In this survey, 60 randomly selected Australian students were asked to report in which month they were born. Bar Graph #2 In this survey, 200 randomly selected international students were asked which hand they write with. Histograms are used to display continuous numerical data where the data is organized into classes. The bars on a histogram are not separated from one another. Histogram #1 In this survey, 100 randomly selected students from all over the world were asked to report how long it took to travel from home to school. In this case the class width is 5. The first class goes from 0 to 5, not including five. The second class goes from 5 to 10, not including 10. NSSAL ©2010 Histogram #2 Forty randomly selected secondary students from Canada were asked to report their heights in centimeters. As with Histogram #1, the class width in this case is 5 however the intervals do not start and end on multiples of 5. For example the first class showing a value is centered at 120. That means that this class goes from 117.5 to 122.5, not including 122.5. 6 Draft C. D. Pilmer Example 1 Thirty-six randomly selected males between the ages of 20 and 29 years of ages were weighed. The weights in pounds are shown below. 210 143 194 174 203 181 224 171 178 186 182 186 188 215 192 182 194 174 166 177 192 188 191 167 207 189 155 178 162 202 160 193 181 188 181 196 (a) Construct a histogram with class widths of 10 starting at 140. (b) What percentage of the randomly selected males weighed less than 180 pounds? Answers: (a) Construct a table to organize the data in terms of the classes. The first class is from 140 to 150 includes 140 but does not include 150. Class 140 to 150 Tally Frequency 1 150 to 160 1 160 to 170 4 170 to 180 6 180 to 190 11 190 to 200 7 200 to 210 3 210 to 220 2 220 to 230 1 Now construct the histogram. (b) Out of the 36 participants, 12 weighed less than 180 pounds. 12 1 × 100 = 33 % 36 3 NSSAL ©2010 7 Draft C. D. Pilmer In Example 1, we encountered a histogram with a symmetrical shape. That means that both sides of the histogram are more or less the same when the graph is folded down the middle. The histogram to the right has a similar configuration. This symmetrical bell-shaped distribution is typical when data is collected from a population which follows a normal distribution. For this course, most of our time will be spent examining situations that follow normal distributions. However, it is important to understand that other types of distributions exist. These other types are shown below. A uniform distribution occurs when every class has equal frequency. A skewed distribution occurs when one tail is much larger than the other tail. A bimodal distribution occurs when two classes with the largest frequencies are separated by at least one class. Uniform Distribution Skewed Left Distribution Skewed Right Distribution Bimodal Distribution If you wish further clarification go to the Mathematics Multimedia Learning Objects (see page iv), access Unit 11-5 Statistics, and view MLO1 Reviewing Histograms and Frequency Polygons. Questions 1. In each case state whether a bar graph or histogram would be used to visually represent the data. Bar Graph or Histogram (a) Fifty randomly selected adults reported the brand of their primary vehicle. (b) Two hundred randomly selected bottles of a particular salsa sauce were pulled off the shelves and tested for their salt content. (c) Seventy randomly selected Tim Hortons franchises reported their profit for the month of November. (d) Three hundred randomly selected adults between 30 and 45 years of age were asked to report the number of children they have. (e) One hundred and fifty randomly selected males reported their favorite sport to view on television. (f) Sixty cups of coffee from randomly selected coffee shops had their serving temperatures recorded. (g) A six-sided die was rolled two hundred times and the number for each roll was recorded. NSSAL ©2010 8 Draft C. D. Pilmer 2. Thirty randomly selected families of four were asked how much they spent on their last family meal at a restaurant. The following data was obtained. 70 68 62 86 78 67 94 82 75 74 66 103 65 97 64 68 80 83 67 71 77 72 69 64 90 72 78 66 64 86 (a) Construct a histogram with class widths of 5 starting at 60. Class 60 to 65 65 to 70 70 to 75 75 to 80 80 to 85 85 to 90 90 to 95 95 to 100 100 to 105 Tally Frequency (b) What percentage of the families spent $90 or more on their meal? (c) What type of distribution (normal, uniform, bimodal,…) are we dealing with? (d) Why was a histogram, rather than a bar graph, used with this data? (e) Are we dealing with a sample or population? 3. Every learner in the Adult Learning Program at one particular campus was asked how many hours a week they spent working on school work. The following data was collected. 36 31 25 26 34 32 31 27 26 26 28 19 23 28 28 32 28 28 24 29 32 29 30 23 41 29 28 28 35 35 23 37 35 37 31 31 30 30 30 28 28 32 (a) Are we dealing with a sample or a population? NSSAL ©2010 9 Draft C. D. Pilmer (b) Construct a histogram with class widths of 5 starting at 15. Class Tally Frequency 15 to 20 (c) What type of distribution (normal, uniform, bimodal,…) are we dealing with? 4. If you were collecting a random sample in each situation, what type of distribution (normal, uniform, bimodal,…) would you likely obtain? Distribution Type (a) You randomly select 100 students at an elementary school and each must report their grade level. Each grade level occupies two classrooms in the school. What would the distribution of grade levels look like? (b) Two groups of athletes are running the 100 m dash. One group is comprised of males 12 years of age or younger, and the other is comprised of males between 16 and 20 years of age. You randomly select 150 athletes and ask them to report their time for the 100 m dash. What would the distributions of times look like? (c) Mrs. Chopra teaches one of the three grade six classes. Normally the administration tries to distribute the strongest math students evenly between the three classes. That did not occur this year and now Mrs. Chopra has a large portion of strong math students in her class. If her class was asked to complete a fair math test, what would the distribution of marks look like? (d) You randomly select 100 females between the ages of 20 and 29 and record their heights. What would the distribution of heights look like? (e) A college instructor had what he described as an average class of students. From his perspective there were a few weak students, a few strong students but the majority of the students were of average ability. He gave the class an extremely challenging test where only the strongest students could maintain good marks. What would the distribution of marks for this test look like? NSSAL ©2010 10 Draft C. D. Pilmer Describing Data, Part 1 Charlie looks at the marks his Level IV Graduate Math learners earned in a particular unit over the last year. {82, 74, 91, 82, 79, 95, 77, 92, 86, 74, 78, 69, 84, 77, 88, 78, 71} He wants to report how well his students performed on this particular unit without having to supply all seventeen pieces of data. The data can be described using measures of central tendency, such as the mean (arithmetic average) and median (middle). Mean The most common measure of central tendency is the arithmetic average, or mean. When calculating a mean, statisticians differentiate between population means and sample means by using different symbols. The procedure for calculating either of these means is identical. The population mean and sample mean are calculated by adding all the data points and then dividing up the number of data points. µ= x1 + x 2 + x3 + ... + x n n where µ (mu) is the population mean x= x1 + x 2 + x3 + ... + x n n where x (x bar) is the sample mean Return to Charlie’s math marks. Since he is looking at the marks of all of the learners who completed the unit, he is dealing with a population. The population mean is calculated below. x1 + x 2 + x3 + ... + x n n 82 + 74 + 91 + 82 + 79 + 95 + 77 + 92 + 86 + 74 + 78 + 69 + 84 + 77 + 88 + 78. + 71 µ= 17 1377 µ= 17 µ= µ = 81 The mean mark for Charlie’s learners on this unit is 81%. Median The mean is not the only way to describe the center. Another method is to use the “middle value” of the data which is called the median. The median separates the higher half of the data from the lower half. It can be calculated in the following manner. 1. Arrange the data points in order of size, from smallest to largest. 2. If the number of data points is odd, then the median is the data point in the middle of the ordered list. 3. If the number of data points is even, then the median is the mean of the two data points that share the middle of the ordered list. NSSAL ©2010 11 Draft C. D. Pilmer Return to Charlie’s math marks. The median is calculated below. Order the data points from smallest to largest 69, 71, 74, 74, 77, 77, 78, 78, 79, 82, 82, 84, 86, 88, 91, 92, 95 Since we have an odd number of data points (n = 17), then median will be in the middle data point of the ordered list. 69, 71, 74, 74, 77, 77, 78, 78, 79, 82, 82, 84, 86, 88, 91, 92, 95 The median will be 79. Suppose we had another instructor who had sixteen learners who completed the same unit. She has recorded the marks that they made and worked out the mean and median. {99, 94, 80, 63, 78, 99, 67, 62, 95, 78, 66, 93, 65, 64, 98, 95} Mean: x + x + x3 + ... + x n µ= 1 2 n 99 + 94 + 80 + 63 + 78 + 99 + 67 + 62 + 95 + 78 + 66 + 93 + 65 + 64 + 98 + 95 µ= 16 1296 µ= 16 µ = 81 The mean mark for these learners on this unit is 81%. Median: 62, 63, 64, 65, 66, 67, 78, 78, 80, 93, 94, 95, 95, 98, 99, 99 78 + 80 Median = = 79 2 Is the Mean and Median Enough? These measures of central tendency often do not give us a complete understanding of the data set because they do not give any indication how the data is spread out. This is especially evident when we look at the means and medians for the two groups of math students discussed above. Although the means and medians are identical for both of these groups, the marks earned by the two groups are vastly different. In Charlie’s group, the majority of students earned marks between 71 and 88. There was only one mark in the sixties and only three marks in the nineties. The marks are clustered together. The marks for the other instructor’s learners could be largely divided into two groups; learners who earned sixties and learners who earned nineties. There were six learners who earned sixties, seven who earned nineties, and every few in between. It is important to note that our two measures of central tendency did not reveal this important difference between the two data sets. We will address this issue in the next section of this unit. When are the Mean and Median Not Close to Each Other? There are times when the mean and median may not be close to each other. One case is if an outlier exists within the data set. An outlier is a data point that falls outside the overall pattern NSSAL ©2010 12 Draft C. D. Pilmer of the data set. Consider the following data set where the data points have already been arranged in ascending order. {2.8, 3.0, 3.0, 3.1, 3.2, 3.4, 3.4, 3.5, 3.5, 3.6, 3.7, 3.9, 4.0, 4.2, 16.7} Notice that all but one data point is between 2.8 and 4.2. The mean for this data set is 4.3 and the median is 3.5. It is obvious that in this case the median is a far better measure of central tendency than the mean. The outlier, 16.7, greatly influenced the mean to a point where it no longer accurately represented the center of the data set. The extreme sensitivity of the mean to even a single outlier and the insensitivity of the median to outliers led to the development of trimmed means. Trimmed means are calculated by ordering the data points from smallest to largest, deleting a selected number of points from both ends of the ordered list, and finally averaging the remaining numbers. For example to calculate the 5% trimmed mean, the bottom 5% of the data points and the top 5% of the data points are deleted. Consider the data set at the top of the page. We will calculate the 5% trimmed mean for this data set. If 5% of the number of data points (i.e. 5% of 15) is 0.75, we would round up to 1 (round to nearest whole number). Since we obtained a 1, we would drop one data point from the bottom and one data point from the top of the data set. 2.8, 3.0, 3.0, 3.1, 3.2, 3.4, 3.4, 3.5, 3.5, 3.6, 3.7, 3.9, 4.0, 4.2, 16.7 Finally we work out the mean of the remaining thirteen data points. 3.0 + 3.0 + 3.1 + 3.2 + 3.4 + 3.4 + 3.5 + 3.5 + 3.6 + 3.7 + 3.9 + 4.0 + 4.2 13 = 3.5 5% trimmed mean = Notice that this trimmed mean is equal to the median that we previously calculated. By eliminating the effects of outliers, the median and resulting mean should be in close proximity. The symbol, x(T ) , is used to represent a trimmed mean. The only problem with this symbol is that it does not indicate whether we are dealing with a 5%, 10%, 15% or 20% trimmed mean. Example 1 Twenty two runners of the 100 m dash were randomly selected from colleges and universities in Canada. The time of each runner in the last competition was recorded. Of these runners, one person had pulled a hamstring and another had tripped during their last competition. The times in seconds are recorded below. Determine the mean, median, and 10% trimmed mean. 10.23 10.89 11.76 9.87 11.33 10.75 9.96 11.54 10.52 18.57 9.72 12.05 11.56 10.15 19.42 11.68 12.09 11.49 11.67 10.19 10.52 9.99 Answer: 10.83 + 10.89 + 11.76 + ... + 10.19 + 10.52 + 9.99 22 = 11.63 Mean = NSSAL ©2010 13 Draft C. D. Pilmer Median: Rearrange the data points from smallest to largest. Since we are dealing with an even number of data points (22), then the median is the mean of the two data points that share the middle of the ordered list. 9.72, 9.87, 9.96, 9.99,…, 10.75, 10.89, 11.33, 11.49,…, 12.05, 12.09, 18.57, 19.42 Median = 10.89 + 11.33 = 11.11 2 10% Trimmed Mean If 10% of the number of data points (i.e. 10% of 22) is 2.2, we would round down to 2 (round to nearest whole number). We will now drop two data points from the bottom and two data points from the top of the data set, and then work out the mean of the remaining eighteen data points. 9.72, 9.87, 9.96, 9.99, 10.15,…, 11.76, 12.05, 12.09, 18.57, 19.42 9.96 + 9.99 + 10.15 + ... + 11.76 + 12.05 + 12.09 18 = 11.02 10% trimmed mean = Questions Please use the appropriate symbols ( x , µ , and x(T ) ) when answering these questions. 1. A study regarding the size of winter wolf packs in regions of the United States, Canada, and Finland was conducted. The following data from 18 randomly selected packs was obtained. 2 3 15 8 7 8 2 4 13 7 3 7 10 7 5 4 2 4 (a) Are we dealing with a sample or a population? _____________________ (b) Determine the mean and the median. (c) Why would the researchers likely not use a trimmed mean with this data set? NSSAL ©2010 14 Draft C. D. Pilmer 2. A local cab company has a fleet of nine cars. The company kept the records for the amount money each vehicle required for a one week period. The data is shown below. $125 $157 $210 $139 $182 $167 $143 $150 $162 (a) Are we dealing with a sample or a population? _____________________ (b) Are we dealing with a numerical or categorical data set? _____________________ (c) Determine the mean and median. 3. A magazine conducted a survey where they wished to understand the average class size of first year courses at a local community college. They randomly selected 17 first year classes and obtained the following numbers. 23 37 36 40 39 115 28 25 23 32 27 16 15 31 27 34 (a) Are we dealing with a sample or a population? 41 ____________________ (b) Determine the mean, median, 5% trimmed mean, and 10% trimmed mean. (c) Why is it appropriate to use trimmed means in this situation? (d) If this data set was comprised of 78 data points and we wanted to calculate a 5% trimmed mean, how many data points would be dropped from the bottom and top of the data set? NSSAL ©2010 15 Draft C. D. Pilmer 4. A new subdivision outside of Halifax was constructed over the last few years. Barb wanted to know what the average value of the new homes was. She was not prepared to look at the assessed values of all 218 new homes. Instead she randomly selected 24 homes and recorded their assessed values. These values in thousands of dollars are shown below. 266 265 226 254 231 221 246 252 253 241 261 589 243 270 267 253 287 320 221 264 257 249 226 267 (a) Calculate the mean, median, and 5% trimmed mean. (b) Which of these measures is not influenced or less influenced by extremely high or low data points? (c) Would a histogram or a bar graph be used with this data set? 5. (a) In gymnastics and diving, several judges score each athlete. The final score for the athlete is calculated by removing the high and low scores and averaging the remainder. Why do you think they use this trimmed mean scoring method in gymnastics and diving? (b) Judging in figure skating has always been controversial but this issue really came to the surface during the 2002 Salt Lake City Winter Olympics when two Canadian skaters, Jamie Sale and David Pelletier were awarded the silver medal, rather than the gold medal as expected by the crowd, many television commentators, and based on the scores of four of the nine judges. It was later learned that the French judge had conspired with the Russian judge to favor the Russian skating pair. At the time they were using an ordinal method for awarding medals, rather than the trimmed mean method used in gymnastics and diving. Explain why the trimmed mean method would also have been ineffective at dealing with this incident of collusion during the 2002 Winter Olympics? NSSAL ©2010 16 Draft C. D. Pilmer Describing Data, Part 2 Measures of central tendency (median and mode) do not give us any indication of how the data is spread out. Consider the following two sets of data. First Data Set: 13, 14, 15, 15, 15, 16, 17 Second Data Set: 10, 12, 13, 15, 17, 18, 20 The mean for both of these data sets is 15 however; the individual pieces of data in these sets are considerably different. In the first set, the numbers range from 13 to 17, and clearly cluster around the number 15. In the second set the numbers range from 10 to 20 and tend to be more spread out around the mean. The dispersion is far greater in the second set, than in the first. Standard deviation is one way of measuring the spread or dispersion of a set of data relative to the mean. If the standard deviation is low, then the data cluster around the mean. If the standard deviation is high, then the data are spread out around the mean. Without getting into the actual calculations, the standard deviation for the first data set is 1.20 and the standard deviation for the second data set is 3.30. The larger number indicates greater dispersion. Calculating Standard Deviation Before we get to the calculations, we have to remind you of an important point and introduce two formulas. In the first section we talked about populations and samples. A population is the set representing all measurements of interest to an investigator while a sample is simply a subset of the measurements from the population chosen at random. We previously learned that both the population mean and sample mean are calculated by adding all the data points and then dividing up the number of data points. The only difference is that we use different symbols to differentiate a population mean from a sample mean. µ= x1 + x 2 + x3 + ... + x n n where µ (mu) is the population mean x= x1 + x 2 + x3 + ... + x n n where x (x bar) is the sample mean Similarly we have two different formulas for population standard deviation and sample standard deviation. They do, however, differ more than just in the symbols used. The formula for population standard deviation, σ (sigma), is shown below. You are not expected to memorize this formula. σ= NSSAL ©2010 (x1 − µ )2 + (x2 − µ )2 + (x3 − µ )2 + ... + (xn − µ )2 n 17 Draft C. D. Pilmer This formula requires that you complete six steps. Step 1: Find the mean; µ . Step 2: Calculate the difference between each data point and the mean; xi − µ . Step 3: Square those differences found in Step 2; ( xi − µ ) 2 Step 4: Add the squared differences; ( x1 − µ ) + (x2 − µ ) + (x3 − µ ) + ... + ( xn − µ ) 2 2 2 2 Step 5: Divide the sum from Step 4 by the number of data points. Step 6: Square root the value from Step 5. The easiest way to learn how to use this formula (i.e. complete the six steps) is to construct a table where only small portions of the calculation are completed at any one time. Example 1 Mrs. Gillis teaches math to adults. At the end of the year she examines the final marks for all of her students who have completed the course. She wants to work out the standard deviation of those marks. 87 72 91 82 74 93 75 83 78 75 81 Answer: Find the mean. x1 + x 2 + x3 + ... + x n n 87 + 72 + 91 + 82 + 74 + 93 + 75 + 83 + 78 + 75 + 81 µ= 11 µ = 81 µ= Construct the table. xi xi − µ 87 72 91 82 74 93 75 83 78 75 81 87 - 81 = 6 72 – 81 = -9 91 – 81 = 10 1 -7 12 -6 2 -3 -6 0 ( x i − µ )2 62 = 36 (-9)2 = 81 (10)2 = 100 1 49 144 36 4 9 36 0 Note: Remember that we stated that the standard deviation is one way of measuring the spread or dispersion of a set of data relative to the mean. Notice that in the second column of this table we are finding the differences between individual data points and the mean. These differences, not surprisingly, are integral in calculating the standard deviation. Sum = 496 NSSAL ©2010 18 Draft C. D. Pilmer 496 11 σ = 6.71 σ= The population standard deviation is 6.71. The formula for sample standard deviation, S x (S subscript x), is shown below. You are not expected to memorize this formula. Sx = (x 1 ) ( 2 ) ( ) 2 ( 2 − x + x 2 − x + x3 − x + ... + x n − x n −1 ) 2 This formula requires that you complete a six step procedure very similar, but not identical, to the procedure for population standard deviation. Example 2 Mr. MacDonald is the dean of the adult education program at the college. At the end of the year he wants to understand the types of marks learners are obtaining in their new math program. Instead of looking at every mark earned in this course, he randomly selects the final marks of 10 students. He wants to work out the standard deviation of those marks. 75 80 70 88 91 77 82 85 73 79 Answer: Find the mean. x1 + x 2 + x3 + ... + x n n 75 + 80 + 70 + 88 + 91 + 77 + 82 + 85 + 73 + 79 x= 10 x = 80 x= Construct the table. xi xi − x 75 80 70 88 91 77 82 85 73 79 -5 0 -10 8 11 -3 2 5 -7 -1 (x −x 25 0 100 64 121 9 4 25 49 1 i ) 2 Sum = 398 NSSAL ©2010 19 Draft C. D. Pilmer 398 10 − 1 S x = 6.65 Sx = The sample standard deviation is 6.65. If you wish further clarification go to the Mathematics Multimedia Learning Objects (see page iv), access Unit 11-5 Statistics, and view MLO6 Standard Deviation. Questions 1. Determine the sample standard deviation for the following data. 25 32 24 28 31 28 x= (x xi − x xi i −x ) 2 2. Determine the population standard deviation for the following data. 3.7 4.3 5.0 4.6 4.0 4.7 3.9 4.2 µ= xi NSSAL ©2010 xi − µ ( x i − µ )2 20 Draft C. D. Pilmer 3. Two data sets have been provided. 15 14 13 18 16 13 16 15 15 17 15 16 14 11 19 16 11 16 (a) Calculate the sample mean and sample standard deviation for each data set. x= x= (x xi − x xi i −x ) 2 xi − x xi (x i −x ) 2 (b) The standard deviations are different for the two data sets. What is this telling you? 4. In the grocery store, Anne noticed that a particular brand of canned beans was labeled 540 grams. She randomly selected 8 cans and checked the weight of their contents. She ended up with the following data. 542 539 544 549 537 541 548 552 (a) What is the median for these data? NSSAL ©2010 21 Draft C. D. Pilmer (b) What is the mean for these data? (c) Determine the standard deviation. xi (d) If Anne selected another random sample of size 8, would we expect to obtain the same mean and standard deviation? Explain. 5. Barb, a math instructor, recorded the heights in centimetres of all of the male students in her Level IV math courses. She obtained the following measurements. 181 173 184 183 190 180 186 176 185 (a) What is the median for these data? (b) What is the mean for these data? (c) Is Barb dealing with a categorical or numerical data set? NSSAL ©2010 22 Draft C. D. Pilmer (d) Determine the standard deviation. xi (e) Another instructor at different campus also has 9 male learners in his Level IV Math courses. He measured their heights. He found the mean to be 182 cm with a standard deviation of 6.4 cm. Based on these results, what can you say about the heights of this instructor’s male learners compared to Barb’s male learners? (f) A third instructor at another campus also has 9 male learners in her Level IV Math courses. She measured their heights. She found the mean to be 179 cm with a standard deviation of 4.8 cm. Based on these results, what can you say about the heights of this instructor’s male learners compared to Barb’s male learners? 6. Create two data sets that meet all of the following conditions. • They have at eight pieces of data. • They must have a mean of 10. • They have standard deviations that are quite different. NSSAL ©2010 23 Draft C. D. Pilmer 7. Without attempting any calculations, match each standard deviation with the appropriate histogram. Explain how you arrived at your answers. Please note that all of the histograms are drawn at the same scale. Standard Deviations: (a) 0.69 (b) 1.40 (c) 3.34 (d) 3.62 Histograms: (i) (ii) (iii) (iv) Matches with _____ Matches with _____ Matches with _____ Matches with _____ Explanation: Note: We have not fully explained the usefulness of standard deviation in this section of the chapter. As we progress though the unit, we will constantly revisit this topic and broaden our understanding of its usefulness, particularly in the context of normal distribution. NSSAL ©2010 24 Draft C. D. Pilmer Using Technology In the last section we learned how to work out the population standard deviation ( σ ) and sample standard deviation (S x ) using paper and pencil. The TI graphing calculators can calculate both of these and more for us. Using such technology is particularly useful when our sample size is large. Example Of the 1643 people who were departing the airport for overseas destinations on the morning of January 13, an airport worker randomly selected 30 people and asked them how long, in minutes, it took to check in and pass through security. She obtained the following data. 40 60 (a) (b) (c) (d) 46 56 68 44 51 53 42 58 55 60 48 45 52 52 38 55 49 46 56 51 50 40 35 50 54 64 50 45 Draw a histogram using technology. Use class widths of 5 starting at 35. Determine the mean time. Determine the standard deviation. Determine the median. Answers: (a) Step 1: Enter the Data STAT > Edit > If data already exists in L1 then > Enter the data in L1 move the cursor up so L1 is highlighted, press CLEAR, and move the cursor back down. Step 2: Draw the Histogram STATPLOT > Select Plot 1 > Turn on the plot, select histogram, > WINDOW Xlist should be L1 and Freg should be 1. > Set Xmin at 35, Xmax at 70, Xscl at 5 > GRAPH > TRACE > Use the right Ymin at 0, Ymax at 10, Yscl at 1 and left arrows NSSAL ©2010 25 Draft C. D. Pilmer (b) Parts (b) and (c) will be done simultaneously. Please note that the data has already been entered in the calculator when we constructed the histogram. STAT > CALC > 1-Var Stats > Enter the List > ENTER () The sample mean x is 50.4. (c) The sample standard deviation (S x ) is 7.64. (See above.) (d) While the 1-Var Stats results are still on the screen, scroll using the down arrow until you find Med. The median in this case is 50.5. Note: The calculator uses the symbol σ x , rather than σ , to represent the population standard deviation. The calculator does not report the population mean ( µ ) however as we previously learned the formula for sample mean and population mean are the same. We can therefore use the value the calculator generates for x as the value for µ . Questions 1. The survey of Study Habits and Attitudes (SSHA) is a psychological test given to college students to evaluate their motivation, study habits, and attitudes towards their post-secondary studies. A local community college campus randomly selected 20 female first year students to complete the SSHA. The individual results are listed below. 154 167 129 151 153 164 140 112 157 144 162 166 158 143 174 190 180 137 175 155 (a) Are we dealing with a population or a sample? (b) Using technology draw a histogram showing the distribution of SSHA scores. Use class widths of 10 starting at 110. (c) Determine the mean, median, and standard deviation. (d) Describe the distribution (normal, uniform, skewed, or bimodal). NSSAL ©2010 26 Draft C. D. Pilmer 2. Below you will find a list of Prime Ministers of Canada since Confederation in 1867. We have also been supplied with their age upon first taking office as PM. Prime Minister (PM) John A. MacDonald Alexander Mackenzie John Abbott John Sparrow Thompson Mackenzie Bowell Charles Tupper Wilfrid Laurier Robert Borden Arthur Meighen William Lyon Mackenzie King Richard Bennett Louis St-Laurent John Diefenbaker Lester Pearson Pierre Trudeau Joe Clark John Turner Brian Mulroney Kim Campbell Jean Chretien Paul Martin Stephen Harper First Term Starts 1867 1873 1891 1892 1894 1896 1896 1911 1920 1921 1930 1948 1957 1963 1968 1979 1984 1984 1993 1993 2003 2006 Age 52 51 70 48 70 74 54 57 46 47 60 66 61 65 48 39 55 45 46 59 65 46 (a) Are we dealing with a population or a sample? (b) Using technology draw a histogram showing the distribution of ages for PMs first taking office. Use class widths of 5 starting at 35. (c) Determine the mean PM age for first taking office. (d) Determine the standard deviation. (e) Determine the median. (f) What can you conclude based on the histogram and standard deviation? NSSAL ©2010 27 Draft C. D. Pilmer 3. Provincial governments keep records of the number of young offenders who are incarcerated each year. The incarceration rates vary greatly from province to province. In 2006 Nova Scotia reported an incarceration rate of 9.91. That means that 9.91 young persons out of 10 000 young persons was incarcerated. Below you will find the incarceration rates for the provinces and territories for 2006. (Source: Statistics Canada) Province YT NT NU BC AB Rate 8.57 46.12 20.49 4.45 7.18 Province SK MB ON QC Rate 24.54 21.25 7.51 3.89 Province NB PE NS NL Rate 10.20 7.21 9.91 11.93 (a) Are we dealing with a population or a sample? (b) Using technology draw a histogram showing the distribution of incarceration rates. Use class widths of 5 starting at 0. (c) Determine the mean, median, and standard deviation. (d) There is a substantial difference between the mean and median. Why is this so? NSSAL ©2010 28 Draft C. D. Pilmer Normal Distribution A frequency polygon is the shape that is formed when midpoints of the tops of the bars on a histogram are joined by straight lines. In this case, the frequency polygon forms a bell-shaped curve that is associated with a population that follows a normal distribution. Many variables observed in nature, including heights, weights, and reaction times, follow normal distributions. Consider the heights of female students at college. There are a few women who are less than 5 feet tall, a few who are taller than 6 feet, but the majority of the women are probably between 5’3” and 5’8”. We would expect a normal distribution for the heights of women attending college. Let’s consider a population that results in a normal distribution. The normal curve will be centered about population mean ( µ ). The standard deviation ( σ ) determines the extent to which the curve spreads out. If we look at the two normal distributions supplied below, we can see that both distributions are A centered around the same value, 65. That means that the mean for both of these populations is 65. The standard deviations, although not supplied, are not the same. The standard deviation for normal distribution A must be lower than B that for distribution B because the curve is narrowing meaning that the data points are more clustered around the mean. Please note that the horizontal axis is labeled x. This indicates that we are looking at the distribution of the individual data points denoted by the symbol x. NSSAL ©2010 29 Draft C. D. Pilmer According to the 68-95-99.7 rule, in any bell-shaped distribution, the following holds true. • Approximately 68% of the data points will lie within one standard deviation of the mean.. • Approximately 95% of the data points will lie within two standard deviations of the mean. • Approximately 99.7% of the data points will lie within three standard deviations of the mean. This rule is true for populations and large samples. However, they are written using different symbols. For Populations: • Approximately 68% of the data points are between µ − σ and µ + σ . • Approximately 95% of the data points are between µ − 2σ and µ + 2σ . • Approximately 99.7% of the data points are between µ − 3σ and µ + 3σ . For Large Samples: • Approximately 68% of the data points are between x − S x and x + S x . • Approximately 95% of the data points are between x − 2 S x and x + 2 S x . • Approximately 99.7% of the data points are between x − 3S x and x + 3S x . Let’s see how this rule applies to a population with a normal distribution where the population mean ( µ ) is 40 and the population standard deviation ( σ ) is 10. This distribution is shown below. Notice that it is centered about the mean. For this population we would expect that approximately 68% of the data points would be between 30 ( µ − σ or 40-10) and 50 ( µ + σ or 40+10). We would expect that approximately 95% of the data points would be between 20 ( µ − 2σ ) and 60 ( µ + 2σ ). Finally we would expect that approximately 99.7% of the data points to be between 10 ( µ − 3σ ) and 70 ( µ + 3σ ). NSSAL ©2010 30 Draft C. D. Pilmer Checking the 68-95-99.7 Rule Using a Simulation Most learners do not follow rules blindly; they like to know where the rule comes from and/or see if the rule really works. The mathematics required to derive the 68-95-99.7 rule is beyond this course. However, we can conduct a simulation on the graphing calculator to demonstrate that the rule does indeed work. We will accomplish this using the random number generator built into the calculator. Before doing so, we will have to seed the calculator to ensure that the numbers generated on your calculator differ from those generated by your classmate’s calculator. Since it is unlikely that you and your classmates share the same telephone number, you will use your phone number to seed the calculator. Type in phone > STO → > MATH > PRB > rand > ENTER number. For the simulation that follows we will be using the rand command that is found under the MATH menu. Suppose you wished to randomly select 100 bee hives of the same size from the same region of the province. You wished to record the seasonal honey production (in kilograms) for each hive over a four year period. We are obviously not going to actually collect this data; instead we will use the graphing calculator to simulate the collection of this data. Step 1: We will simulate the collection of honey production numbers for the 100 hives for the first year. Once completed, you will have values ranging from 40 to 70 in List 1. STAT > EDIT > Move the cursor up > Enter 40+30*rand(100) to highlight L1 > ENTER Step 2: Now we will simulate the collection of honey production numbers for the 100 hives for years two, three, and four. This will be accomplished by repeating Step 1 but using List 2, List 3, and List 4. Once completed, you will have values ranging from 40 to 70 in all four lists. Step 3: Since we want the total honey production over the four year period for each hive, we will need to add the numbers in the same row. This can be accomplished using List 5 where one states that its values are generated by adding the corresponding values in Lists 1, 2, 3, and 4. NSSAL ©2010 31 Draft C. D. Pilmer STAT > EDIT > Move the cursor up > Enter L1+L2+L3+L4 to highlight L5 > ENTER Step 4: Order the numbers in List 5 from smallest to largest (i.e. in ascending order). QUIT > CLEAR > STAT > SortA( > L5 > ENTER Step 5: Enter the one hundred data points from list 5 in the chart below. Round the values to the nearest tenth. Step 6: Construct a histogram using the following classes. Class 150 to 160 160 to 170 170 to 180 180 to 190 190 to 200 200 to 210 210 to 220 220 to 230 230 to 240 240 to 250 250 to 260 260 to 270 270 to 280 NSSAL ©2010 Frequency 32 Draft C. D. Pilmer Step 7: Use the calculator to determine the sample mean and sample standard deviation for the data in List 5. STAT > CALC > 1-Var Stats > L5 > ENTER Record the two values. Sample Mean = ________ Sample Standard Deviation = ________ Questions: 1. In terms of this situation, what does the first data point in List 5 represent? 2. In terms of this situation, what does the last data point in List 5 represent? 3. In terms of this situation, what does x represent? 4. (a) Calculate x − S x and x + S x using the values we obtained in Step 7. (b) Go through the chart from Step 5 and count the number of data points that are between x − S x and x + S x . (c) According to the 68-95-99.7 rule, approximately what percentage of the data points should be between x − S x and x + S x ? Is this supported by this simulation? Explain. NSSAL ©2010 33 Draft C. D. Pilmer 5. (a) Calculate x − 2 S x and x + 2 S x using the values we obtained in Step 7. (b) Go through the chart from Step 5 and count the number of data points that are between x − 2 S x and x + 2 S x . (c) According to the 68-95-99.7 rule, approximately what percentage of the data points should be between x − 2 S x and x + 2 S x ? Is this supported by this simulation? Explain. Note: For Questions 4 and 5, you may feel that the data points from the simulation do not resoundingly support the 68-95-99.7 rule. Remember that the sample we took was only of size 100 (i.e. n = 100). Better results could be obtained if we increased the sample size significantly (e.g. n = 1000) but unfortunately it would take us a lot more time to complete the simulation and accompanying questions. NSSAL ©2010 34 Draft C. D. Pilmer Using the 68-95-99.7 Rule In the last section we learned how the 68-95-99.7 rule applies to normal populations or large samples that result in a distribution that is approximately normal. In this section, we will show how this rule can be used to answer a number of questions. Consider the following statements for a normal population. • • If 68% of the data points are found between µ − σ and µ + σ , then 34% of the data points would be between µ and µ + σ . If 68% of the data points are found between µ − σ and µ + σ , then 34% of the data points would be between µ − σ and µ . 68% 34% 34% µ −σ x µ +σ µ If we extend this line of thinking, we can state the following. • • • • If 95% of the data points are found between µ − 2σ and µ + 2σ , then 47.5% of the data points would be between µ and µ + 2σ . If 95% of the data points are found between µ − 2σ and µ + 2σ , then 47.5% of the data points would be between µ − 2σ and µ . If 99.7% of the data points are found between µ − 3σ and µ + 3σ , then 49.85% of the data points would be between µ and µ + 3σ . If 99.7% of the data points are found between µ − 3σ and µ + 3σ , then 49.85% of the data points would be between µ − 3σ and µ . Naturally this line of thinking can also be applied to samples that result in a distribution that it approximately normal; however, we will use the symbols x and S x . NSSAL ©2010 35 Draft C. D. Pilmer Example 1 For a normal population with a mean of 15 and standard deviation of 2, what percentage of the data points would measure (a) between 15 and 19? (b) between 13 and 21? (c) between 11 and 13? Answers: (a) This question could be restated. It would read, “What percentage of the data points would be between µ and µ + 2σ ?” 47.5% 15 µ x 19 µ + 2σ Therefore approximately 47.5% of the data points will be between 15 and 19. (b) This question could be restated. It would read, “What percentage of the data points would be between µ − σ and µ + 3σ ?” 34% 13 µ −σ 49.85% 15 µ 21 µ + 3σ x Therefore approximately 83.85% (34% + 49.85%) of the data points will be between 13 and 21. NSSAL ©2010 36 Draft C. D. Pilmer (c) This question could be restated. It would read, “What percentage of the data points would be between µ − 2σ and µ − σ ?” 34% 47.5% 11 13 µ −σ µ − 2σ 15 µ Therefore approximately 13.5% (47.5%-34%) of the data points will be between 11 and 13. This is a difficult concept to explain without a lot of diagrams. It is strongly recommended that you seek further clarification by going to the Mathematics Multimedia Learning Objects (see page iv), accessing Unit 11-5 Statistics, and viewing MLO8 Using Normal Distribution. Questions 1. Use the 68-95-99.7 rule on a distribution of data points with a population mean of 230 and a population standard deviation of 15 to answer the following questions. (a) What percentage of the data points would measure between 215 and 245? (b) What percentage of the data points would measure between 230 and 260? (c) What percentage of the data points would measure between 215 and 230? (d) What percentage of the data points would measure between 185 and 230? NSSAL ©2010 37 Draft C. D. Pilmer (e) What percentage of the data points would measure between 200 and 245? (f) What percentage of the data points would measure between 215 and 275? (g) What percentage of the data points would measure between 185 and 260? (h) What percentage of the data points would measure between 245 and 260? (h) What percentage of the data points would measure between 185 and 200? (j) What percentage of the data points would measure between 245 and 275? 2. A sample of randomly selected 2000 bagels of the same type was removed from a production line. The mean weight was 104 grams with a standard deviation of 3 grams. Assume the distribution of bagel weights is bell-shaped. (a) Approximately how many bagels were within 9 grams of the mean? (b) Approximately how many bagels were within 3 grams of the mean? (c) Approximately how many bagels are between 98 grams and 104 grams? NSSAL ©2010 38 Draft C. D. Pilmer (d) Approximately how many bagels are between 101 grams and 110 grams? (e) Approximately how many bagels are between 107 grams and 110 grams? (f) Approximately how many bagels are between 98 grams and 110 grams? (g) Approximately how many bagels are between 95 grams and 101 grams? (h) Approximately how many bagels are between 98 grams and 113 grams? (i) Approximately how many bagels are between 95 grams and 104 grams? (j) Approximately how many bagels are between 110 grams and 113 grams? NSSAL ©2010 39 Draft C. D. Pilmer Making Inferences Up to this point in this resource, we have looked at a variety of ways of describing data whether that data was derived from a sample or a population. This means that we have focused on descriptive statistics. At this point in the unit, we will now focus on making inferences about a population based on a sample. In other words, we will use data obtained from a sample to try to understand the population. This is inferential statistics. There are specific steps for making such inferences. 1. Specify the question(s) to be investigated and identify the population of interest. 2. Determine how the sample will be selected from the population. 3. Obtain the sample and analyze the sample information. 4. Use the sample information to make inferences about the population. 5. Use a measure of reliability to indicate how much confidence can be placed on that inference. The next two sections of this unit will look at the different ways to collect a sample. After that, the remaining sections will focus on steps 3, 4, and 5. Specifically we will look at a new concept for most of you, confidence intervals based on a sample mean. NSSAL ©2010 40 Draft C. D. Pilmer Collecting a Sample Since collecting data from an entire population is often not feasible, we may use a sample from that population in order to answer questions about the population as a whole. It is important that we collect unbiased samples to ensure that the samples are representative of the population. Investigation Suppose we want to know the mean (average) square footage of buildings in a local industrial park. Rather than looking at all forty-eight buildings, we only want to collect a sample of size 6 (i.e. look at only 6 buildings). The diagram below shows an aerial view of the park where each quadrilateral (four-sided figure) represents a building and each square represents 1000 square feet. For example building #2 is represented by 4 squares therefore the area of that building is 4000 square feet. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. NSSAL ©2010 41 Draft C. D. Pilmer We will collect five samples of size six from this population. In the first case, you will look at the population and then select the six buildings that you believe best represent the population. This will be called our non-random sample. Record the six building numbers and the corresponding areas (in square feet). After doing this, determine the mean building area and the standard deviation for our sample of size 6. Non-Random Sample Building Number Area () Mean x = _________ Sample Standard Deviation (S x ) = ________ To collect the other four samples of size six, we will use the random number generator on the graphing calculator. The calculator will be instructed to generate six random integers between 1 and 48, where each number generated is a building number. This can be accomplished using the following commands. MATH > PRB > randInt( > Enter the lower and upper limit, and the sample size (all separated by commas). Record the six building numbers and the corresponding areas (in square feet). After doing this, determine the mean building area and the standard deviation for the sample of size 6. Repeat this procedure three more times to generate random samples #2, #3, and #4. Random Sample #1 Building Number Area () Mean x = _________ Sample Standard Deviation (S x ) = ________ Random Sample #2 Building Number Area () Mean x = _________ NSSAL ©2010 Sample Standard Deviation (S x ) = ________ 42 Draft C. D. Pilmer Random Sample #3 Building Number Area () Mean x = _________ Sample Standard Deviation (S x ) = ________ Random Sample #4 Building Number Area () Mean x = _________ Sample Standard Deviation (S x ) = ________ Conclusions: The population mean (µ ) and population standard deviation (σ ) for the areas of these fortyeight buildings has already been worked out. σ = 2993 µ = 3146 Look at the five sample means and five sample standard deviations you previously worked out. (a) How do these sample means and standard deviations compare to the population mean and standard deviation? (b) How do the results from your non-random sample compare to those from your four random samples. What can you conclude? In this activity, you collected five samples (one non-random and four random), each with a sample size of 6 (i.e. n = 6). The sample size refers to the number of data points in the sample. NSSAL ©2010 43 Draft C. D. Pilmer Questions: 1. Explain why each of the following would likely produce a biased sample. For some there is more than one reason. (a) David will conduct a survey regarding violence in the media. He will randomly select people who are attending an ultimate fighter competition at the local arena and ask them to complete the survey. (b) Genevieve wants to know how much money the average woman spends monthly on clothing. She will conduct the survey at Mic Mac Mall. She approaches people who she feels will likely answer her survey questions. If they agree to participate in the survey, then she asks them the questions. (c) A television talent show asks viewers to phone in their vote(s) for their favorite contestant. The telephone lines are only open for four hours and viewers can vote as many as six times. (d) Kendrick wants to know how members of his community feel about the new gun registry law. He leaves survey sheets on a counter at the local hardware store. There is also a sign asking interested individuals to complete the survey. (e) Robert wants to know if Canadians still support the military action in Afghanistan. He conducts a phone survey where he asks 200 randomly selected adults the following question. “Considering the number of deaths and injuries of Canadian soldiers, and persistent allegations of prisoner abuse by local Afghan authorities, should Canadian soldiers remain in Afghanistan? NSSAL ©2010 44 Draft C. D. Pilmer Sampling Methods Preferred Sampling Methods The sampling methods listed below are considered preferred sampling methods because these methods have a greater chance of being unbiased. All four of these are a form of random sampling. 1. Simple Random Sample A simple random sample is a sample selected in such a manner that every sample of size n has the same chance of being selected. For example, suppose we put twenty different names into a hat, stirred the contents, and without looking retrieved the following four names. Barb, Elliot, Brian, Dave Suppose those names were returned to the hat, the contents stirred, and now the following four names were drawn. Floyd, Joan, Manish, Krys Suppose that process was repeated and the following four names were obtained. Suzette, Charlie, Jeff, Elliot We collected three samples of size four. All three of these samples, along with all other combinations of four names, have the same probability of being drawn. The Barb, Elliot, Brian, Dave combination has no greater chance of being selected than the Floyd, Joan, Manish, Krys combination or any other four name combination. For this reason it is a simple random sample. 2. Cluster Sample A cluster sample is used when the available sampling units are groups of elements, or clusters. One or more clusters are randomly selected and then every element in that cluster is included in the sample. For example, suppose Tim Hortons wanted to know how much on average each person spent in their Toronto establishments between the hours of 7 am to 9 am. Conducting a survey in their hundreds of establishments scattered across the city would be costly. Instead, they could randomly select four of their establishments and record how much each patron spend at those four establishments between the hours of 7 am and 9 am. They randomly selected four clusters and included every element (patron) in the survey. For these reasons, this is considered a cluster sample. 3. Stratified Random Sample With a stratified random sample one conducts a simple random survey with each of the given number of subpopulations, or strata. For example, suppose the federal party in power wanted to how Canadians felt about gun registration. If they conducted a simple random survey of size 1000, they would not be certain that every province or territory was fairly represented in the survey. It is possible that one province is over-represented and another is under-represented. To alleviate this problem, NSSAL ©2010 45 Draft C. D. Pilmer they could use a stratified random sample where each province and territory (strata) is proportionally represented. So if one province accounted for 20% of the eligible voters in Canada, then the survey would ensure that 200 of the 1000 randomly selected eligible voters came from that province. If another province only accounted for 7% of the eligible voters in Canada, then the survey would ensure that only 70 of the 1000 randomly selected eligible voters came from this particular province. 4. Systematic Sample A systematic sample is chosen according to a formula or rule. For example, suppose you wanted to use the names listed in a phone book to conduct a telephone survey within a small community. You might decide to contact every 200th person in the book but you need to know where you should start on the list. You could use a random number generator to select a number between 1 and 200. Suppose it produced the number 67. That means the first five people selected would be the 67th, 267th , 467th, 867th and 1067th in the telephone book. This is an example of a systematic sample because the rule required that we interview every 200th person. To increase the likelihood that the sample would be unbiased, random numbers were used to select the starting point on the list. Poor Sampling Methods The two sampling methods listed below are not a random sampling method and are generally biased. The results obtained from these samples are generally not representative of the population. 1. Voluntary Response Sample Participants are not selected; rather through their own actions they choose to participate in the survey. The most common forms are call-in polls and online voting. The most familiar example of this poor sampling method occurs on the television show American Idol. Audience members are encouraged to phone or text their vote in for their favorite performer. Some audience members will choose not to participate while others may vote multiple times. 2. Convenience Sample A convenience sample is chosen based on convenient availability. For example, suppose an individual wanted to know how drivers feel about recent changes to vehicle inspections proposed by the provincial government. The individual decides to conduct his survey at a local shopping mall close to his residence. He approaches individuals who he feels might be willing to participate in his survey. If you wish further clarification go to the Mathematics Multimedia Learning Objects (see page iv), access Unit 11-5 Statistics, and view MLO5 Identifying Types of Samples. NSSAL ©2010 46 Draft C. D. Pilmer Questions 1. Identify the sampling method used. Both preferred and poor sampling methods are found below. (a) A community organization wanted to use a sample to infer how much parents of elementary school children were spending this September on each child’s school supplies in their school district. Each child has a five digit school identification number. The organization placed the numbers 0 through 99 in a hat. They drew the number 28. Based on this they asked any parent whose child’s ID number ended with the digits 2 and 8 to participate in the survey. Method: _______________________________________ (b) Asra wants to know how her fellow employees feel about the company’s new medical plan. She leaves the survey sheets on a table in the company cafeteria. There is also a sign that asks interested individuals to complete the survey. Method: _______________________________________ (c) Jack received twelve baskets of apples from a farmer to sell at the local market. He wanted to use a sample to infer the average weight of the apples he was selling. He numbered the baskets 1 through 12, rolled a twelve-sided die, obtained a 9, weighted every apple in basket 9, and worked out the average weight of those apples. Method: _______________________________________ (d) Montez wants to know if people feel that cable companies should have to pay local television stations when they rebroadcast their signals. Since he owns the local gas station, he decided to conduct the survey from this premise. He asks every customer to complete the survey. Method: _______________________________________ (e) Jorell’s company is giving away a one week all-inclusive vacation package to one of its employees. The thirty employees fill out a ballet, the ballots are placed in a bucket, the contents are stirred, and a ballot is drawn in order to determine the winner. Method: _______________________________________ (f) A new reality show, So You Think You Can Yodel, asks its television audience to vote online for their favorite performer. Method: _______________________________________ NSSAL ©2010 47 Draft C. D. Pilmer (g) Kimi is in charge of the corporate headquarters for a large company. There are 1000 female employees and 500 male employees on the premises. She has decided to build an employee wellness centered stocked with gym equipment on the premises. In order to ensure that she buys the appropriate equipment for the employees, she conducts a sample of size 120 asking respondents about their gym equipment preferences. She randomly selects 80 females of the 1000, and 40 males of the 500 to complete the survey. Method: _______________________________________ (h) The Metro Center is hosting an Ultimate Fighting Challenge competition. Following the event the promoters wanted to know how the audience felt about the competition. The Center is divided into 43 sections. They randomly selected four numbers between 1 and 43, and ask all the individuals in those four sections to complete a questionnaire. Method: _______________________________________ (i) Ranelda is planning a trip to the Dominican Republic and wants to know travelers felt about the all-inclusive resort she is considering. She decides to go to TripAdvisor.com where she can read reviews posted by other travelers regarding this resort. Method: _______________________________________ (j) The owner of a large car dealership wants to know if her customers were satisfied with the purchase of their vehicles. When each customer’s final paper work comes across the owner’s desk, she rolls a six-sided die. If the number 1 or 2 is rolled, then the customer is contacted two weeks later to complete a brief telephone survey. Method: _______________________________________ (k) A Federal politician wants to know how his constituents in 12 different districts feel about the new tax increases. A random sample is selected in such a manner that there is proportional representation from each of the districts. Method: _______________________________________ 2. In question 1, which surveys would likely result in biased results? 3. Suppose you worked for the Nova Scotia Department of Education and you were in charge of determining how well grade 12 students did on the last provincial math exam. When these exams are distributed to the schools they are numbered, starting at 80000 and going to 82500. You could ask that all 44 public and private schools to send in the 2500 corrected exams, review them all, and report the results. This would be a very time-consuming and NSSAL ©2010 48 Draft C. D. Pilmer costly endeavor. Instead you decide to collect a sample of size 500, review those exams, and report the results. Please note that not all high schools in our province are of the same size. Larger schools have graduating classes exceeding 300, while smaller school may have as few as 10 graduates. Explain how all four preferred sampling methods could be used to collect this sample. (a) Simple Random Sample (b) Cluster Sample (c) Stratified Random Sample (d) Systematic Sample Important Note: In the remaining sections of this unit, we will only be concerned with data collected from simple random samples. If we were to consider the other preferred sampling techniques, then we would have to learn how to use a wider range of statistical tools. That is beyond the scope of this course. NSSAL ©2010 49 Draft C. D. Pilmer Simulated Sampling In the next few sections, we will examine how a sample is used to make inferences about a population. Ultimately this will lead us to confidence intervals but this will be a gradual process. To understand the relationship between a sample and a population, we will have to start with a known population. This is a population where we know the population mean and population standard deviation. This probably seems a little backwards. Why would we want to collect a sample if we already understand the population itself? We will be taking this approach so that we can ultimately see how the concept of a confidence interval was developed. Our Known Population In the next three sections of this unit, we will work with the same known population when we are conducting simulations or providing explanations. Suppose the federal government had tested the air quality of every household residence (houses, apartments, mobile homes, etc.) in Canada. They specifically looked at the concentration of a specific airborne contaminant. This concentration was measured in micrograms per cubic metre ( µg / m3 ). Suppose they had found that the results followed a normal distribution with a population mean of 412 µg / m3 and a standard deviation of 38 µg / m3 . Investigation We will simulate the collection of a large sample (sample size 40) from our known population. We will then determine the mean for that sample. We will then simulate the collection of three more samples of the same size from the same population and work out their corresponding means. All of this will be done using a graphing calculator. Sample 1 We will use the randNorm( command on the graphing calculator to simulate the collection of a large sample from our known population. This command generates and displays one or more random numbers from a specific normal distribution. For this reason we must also enter the population mean ( µ ), standard deviation ( σ ), and sample size (n). MATH > PRB > randNorm( > Enter 412, 38, and 40, all separated by commas. Close the brackets. > STO → > L1 > ENTER Go to List 1 and record the first and last five data points in the table below. NSSAL ©2010 50 Draft C. D. Pilmer In terms of this situation, what does the first data point in the table represent? Rather than using the 1-Var Stats command to determine the sample mean, we will use the mean( command embedded in the LIST commands. The reason for this alternate approach will become more apparent as we work through this section and the next. LIST > MATH > mean( > L1 > ENTER Record the sample mean ( x ). Sample Mean = ___________ Sample 2 Follow the same procedure to simulate the collection of another sample of the same size. Record the first and last five data points from L1 in the table below. Also determine the sample mean. Sample Mean = ___________ Do the first and last five data points for Sample 2 match with the first and last five data points for Sample 1? Why is this? Sample 3 In this case, we will not use the randNorm( and mean( commands separately. We will combine them such that we can obtain our sample mean in one step. Use the command shown below. mean(randNorm(412, 38, 40)) Record the sample mean ( x ). Sample Mean = ___________ Sample 4 Repeat the procedure used for Sample 3 to simulate the collection of our fourth sample. Record the sample mean. Sample Mean = ___________ NSSAL ©2010 51 Draft C. D. Pilmer Questions 1. In terms of this situation, what does the mean for Sample 1 represent? 2. Are the four sample means equal to each other? Why is this? 3. Are the sample means close to the expected value? Explain. 4. Given a specific situation and population, which one of these four statements is correct? Explain how you arrived at your answer making reference to our simulations on the last two pages. (a) The population mean is random and the sample mean is fixed. (b) The population mean is random and the sample mean is random. (c) The population mean is fixed and the sample mean is fixed. (d) The population mean is fixed and the sample mean is random. Explanation: 5. In our simulation, we collected four samples and determined four sample means for our known population. Suppose we were dealing with an unknown population. If this was the case, would we be able to determine which of the sample means was closest to the population mean? YES or NO NSSAL ©2010 52 Draft C. D. Pilmer Sampling Distribution of the Sample Means Up to this point when we have looked at distributions, primarily normal distributions, we were looking at how individual data points, x, were spread relative to the mean. For this reason the horizontal axis on these distributions was labeled with the symbol x. x In this section, we will not be looking at the spread of individual data points for a known population. Instead, we will look at the distribution of sample means ( x ) for a known population. That means we could repeatedly take samples of the same size from our known population, work out the sample means, and look at the distribution of sample means. This type of distribution is called the sampling distribution of the sample means. As we learned in the last section, sample means are random, but is there a pattern to all those sample means that we can exploit? Yes, and one is that such distributions are normal as shown below. Notice that the horizontal axis is labeled x , indicating that we are looking at the distribution of sample means, not individual data points. There are two other important properties that will be discovered in the next investigation. x Investigation A true sampling distribution of the sample means is the distribution of all possible values of the sample means that result when random samples of the same size are drawn from the same population. This means that we are taking all possible samples of size n from this population, working out the sample means, and looking at the distribution of those means. The mathematics required to create a true sampling distribution is beyond this course; however, we can use a NSSAL ©2010 53 Draft C. D. Pilmer graphing calculator to simulate the collection of data to generate a rough approximation of the sampling distribution of the sample means. We will continue to work with the scenario involving the airborne contaminant in Canadian households. Remember that for this known population we have a mean of 412 and standard deviation of 38. We will simulate the collection of 100 samples of size 40 and calculate the corresponding 100 sample means. A frequency distribution of these 100 sample means will serve as our rough approximation of the sampling distribution of the sample mean. Step 1 In the last section we used the commands mean(randNorm(412,38,40)) to simulate the collection of one sample of size 40 and then work out the sample mean. In this investigation, we want to do this 100 times so that we end up with 100 sample means. This will be accomplished by adding the seq( command. This command is found under LIST and then accessing OPS. seq(mean(randNorm(412,38,40)),A,1,100) → L1 Press ENTER to activate the command. You should see a small scrolling line in the upper right-hand corner of the screen. This indicates that the calculator is busy working on your task. It will take the calculator 4 to 5 minutes to complete the command. (i) Why do you think it takes so long for the calculator to complete the command? Step 2 Rearrange the data in List 1 from smallest to largest (i.e. ascending order). Use the SortA( command found by pressing the STAT button. Record the first and last five values in the newly sorted List 1. (ii) In terms of this situation, what does the first value in your table represent? NSSAL ©2010 54 Draft C. D. Pilmer Step 3 Use the data in List 1 to construct a histogram. Use the classes shown in the table below. Class 394 to 397 397 to 400 400 to 403 403 to 406 406 to 409 409 to 412 412 to 415 415 to 418 418 to 421 421 to 424 424 to 427 427 to 430 430 to 433 Frequency (iii) How would we describe this distribution (uniform, skewed, bimodal, normal)? Step 4 Use the 1-Var Stats command to determine the mean and the standard deviation for the data in List 1. Please note that the calculator does not know that the 100 values in List 1 are sample means therefore it reports the mean as x and the standard deviation as S x . Although the calculator will spit out the correct numbers, it does not report them using the correct symbols. We will learn what the correct symbols should be in the next section of this unit. Mean of the 100 Sample Means = __________ Standard Deviation of the 100 Sample Means = __________ Step 5 We will use the calculator to generate another 100 sample means. The only difference is that we will use samples of size 60, rather than size 40. Enter the following command into the calculator and give it 4 to 5 minutes to complete the task. seq(mean(randNorm(412,38,60)),A,1,100) → L1 NSSAL ©2010 55 Draft C. D. Pilmer In this case we are still looking at data that would produce a rough approximation of the sampling distribution of the sample means. We will not bother sorting the data or drawing a histogram. We will, however, use the 1-Var Stats command to determine our mean and standard deviation. Mean of the 100 Sample Means = __________ Standard Deviation of the 100 Sample Means = __________ Questions 1. (a) What is the population mean in this situation? _________ (b) When we collected 100 random samples of size 40 from our known population, what did we obtain for the mean of the 100 sample means? _________ (c) When we collected 100 random samples of size 60 from our known population, what did we obtain for the mean of the 100 sample means? _________ (d) Is there a relationship between the population mean and the two means of the sample means? Explain? Why do you think this is? 2. (a) What is the population standard deviation in this situation? _________ (b) When we collected 100 random samples of size 40 from our known population, what did we obtain for the standard deviation of the 100 sample means? _________ (c) When we collected 100 random samples of size 60 from our known population, what did we obtain for the standard deviations of the 100 sample means? _________ (d) Based on our answers for (a), (b) and (c), we can see that the population standard deviation is not equal to either of the standard deviations of the sample means. There is, however, a relationship between these standard deviations. Take the population standard deviation and divide it by the square root of the sample size. This will have to be done twice since we were working with two different sample sizes (n = 40 and n = 60) population standard deviation sample size NSSAL ©2010 56 Draft C. D. Pilmer For the 100 samples of size 40 population standard deviation sample size = = = = For the 100 samples of size 60 population standard deviation sample size What are these two values we just calculated approximately equal to? Through this investigation, we have discovered three important properties of the sampling distribution of the sample means. These three properties will be discussed in the next section of the unit titled the Central Limit Theorem. NSSAL ©2010 57 Draft C. D. Pilmer Central Limit Theorem In the last section, we examined the sampling distribution of the sample means. This type of distribution is created by repeatedly taking samples of the same size from a known population, working out the sample means, and looking at the distribution of sample means. Although we were unable to examine a true sampling distribution of the sample means, we were able to use a graphing calculator to generate a rough approximation of the sampling distribution of the sample means. Through a simulation we discovered three important properties of the sample distribution. Three Properties 1. The sampling distribution of the sample means is approximately normal (i.e. bellshaped). 2. The mean of the sample means is equal to the population mean. Mean of the Sample Means = Population Mean 3. The standard deviation of the sample means is equal to the population standard deviation divided by the square root of the sample size. Population Standard Deviation Standard Deviation of the Sample Means = Sample Size All of this can be restated using the appropriate notation. It is referred to as the Central Limit Theorem. The Central Limit Theorem states the following. • If random samples of size n are repeatedly drawn from any population with a finite mean and standard deviation, then the resulting sampling distribution of the sample means ( x ) is approximately normal when n is large (i.e. n ≥ 30 ). • The mean of the sample means is equal to population mean. ( µ x is pronounced “mu subscript x bar”) µx = µ • The standard deviation of the sample means is equal to the population standard deviation divided by the square root of the sample size. σx = σ n ( σ x is pronounced “sigma subscript x bar”) µx = µ σx σx x NSSAL ©2010 58 Draft C. D. Pilmer Applying the 68-95-99.7 rule to the sampling distribution of the sample means, we can say that: • 68% of the sample means are between µ x − σ x and µ x + σ x Or 68% of the sample means are between µ − • σ n σ n 95% of the sample means are between µ x − 2σ x and µ x + 2σ x Or 95% of the sample means are between µ − 2 • and µ + σ and µ + 2 n σ n 99.7% of the sample means are between µ x − 3σ x and µ x + 3σ x Or 99.7% of the sample means are between µ − 3 σ n and µ + 3 σ n Example 1 Random samples of the size 50 are repeatedly drawn from a known population whose population mean is 78 and population standard deviation is 12. This information is used to construct a sampling distribution of the sample means. (a) Describe the shape of the resulting distribution. (b) Where is the sampling distribution of the sample means centred? (c) What is the standard deviation of the sample means? (d) Between what two values would one expect 68% of the sample means to fall? (e) Between what two values would one expect 95% of the sample means to fall? Answers: We are not dealing with a single sample because the question stated that we are repeatedly collecting samples of the same size. As stated, we are dealing with the sampling distribution of the sample means. For this reason, this question expects that we understand the Central Limit Theorem. (a) According to the Central Limit Theorem, the sampling distribution of the sample means will be bell-shaped. (b) The sampling distribution of the sample means will be centred about the population mean of 78. (c) The standard deviation of the sample means must be calculated. σx = σx = σ n 12 50 σ x = 1.70 NSSAL ©2010 59 Draft C. D. Pilmer (d) We know that for a sampling distribution of the sample means 68% of the sample means are between µ − µ− σ and µ + n σ σ n µ+ n = 78 − 1.70 = 76.3 . σ n = 78 + 1.70 = 79.7 For this particular sampling distribution 68% of the sample means are between 76.3 and 79.7. (e) We know that for a sampling distribution of the sample means 95% of the sample means are between µ − 2 µ −2 σ n and µ + 2 σ σ µ+2 n = 78 − 2(1.70 ) = 74.6 n . σ n = 78 + 2(1.70 ) = 81.4 For this particular sampling distribution 95% of the sample means are between 74.6 and 81.4. Example 2 A random sample of size 40 is taken from a known population where µ = 24.3 and σ = 4.1 . The data points collected are shown in the chart below. 18.78 15.33 27.53 26.45 24.49 27.99 25.83 20.44 21.08 22.08 25.43 21.41 22.36 15.50 15.21 26.02 20.70 20.45 18.54 20.84 16.91 22.47 29.13 26.68 28.26 24.20 19.89 26.98 22.96 19.31 20.01 23.53 26.49 34.21 26.85 24.15 25.66 23.26 20.61 27.19 (a) (b) (c) (d) (e) What is the population mean? What is the population standard deviation? What is the sample mean? Is it close to the expected value? Explain. What is the sample standard deviation? If we collected 800 random samples of size 40 from this known population, could the distribution of the 800 sample means serve as a rough approximation of the sampling distribution of the sample means? Explain. (f) If we collected 800 random samples of size 40 from this known population, what would be the approximate value of the mean of the sample means? (g) If we collected 800 random samples of size 40 from this known population, what would be the approximate value of the standard deviation of the sample means? (h) Between what two values would one expect 544 of the 800 sample means to fall? NSSAL ©2010 60 Draft C. D. Pilmer Answers: The first four parts of this question have nothing to do with the Central Limit Theorem since we are not repeatedly collecting random samples of the same size from a known population. (a) The population mean ( µ ) is supplied in the question. It is 24.3. (b) The population standard deviation ( σ ) is supplied in the question. It is 4.1. (c) We were given a chart for a sample of size 40. We will enter the 40 data points into List 1 on the graphing calculator and use the 1-Var Stats command to determine the sample mean ( x ) for this question and the sample standard deviation ( S x ) for the next question. The sample mean is 23.13. One would expect the sample mean (23.13) to be close to the population mean (24.3). This is the case. (d) The sample standard deviation is 4.16. (e) A true sampling distribution of the sample means involves taking all possible samples of the same size from the population. In this question we have limited ourselves to 800 samples of the same size (n = 40) but the resulting distribution of sample means will serve as a rough approximation of the sampling distribution of the sample means. In the previous section we completed an investigation where we simulated the collection of 100 samples from a known population to generate a rough approximation of the sampling distribution of the sample means. We are doing the same thing in this question except we are dealing with 800 random samples of the same size instead of 100 random samples of the same size. (f) Since we are dealing with a rough approximation of the sampling distribution of the sample means, we can use the Central Limit Theorem. We learned that the mean of the sample means is equal to the population mean. µx = µ µ x = 24.3 (g) We can use the Central Limit Theorem. The standard deviation of the sample means is equal to the population standard deviation divided by the square root of the sample size. σx = σ n 4.1 σx = 40 σ x = 0.65 (h) The number 544 is 68% of 800. That means that the question could be restated as “Between what two values would one expect 68% of the sample means to fall?” We know that for a sampling distribution of the sample means 68% of the sample means are between µ − NSSAL ©2010 σ n and µ + σ n . 61 Draft C. D. Pilmer µ− σ µ+ n = 24.3 − 0.65 = 23.65 σ n = 24.3 + 0.65 = 24.95 For this particular rough approximation of the sampling distribution of the sample means, 544 of the 800 sample means will be between 23.65 and 24.95. Example 3 The mean height of men in a specific age group is 71 inches with a standard deviation of 2.3 inches. Let x be the sample mean height for a random sample of 30 men in this age group. What is the mean value and standard deviation of the distribution of all possible x ’s? Answer: In this contextual problem, the relevant material is presented in a much more subtle manner. The population mean ( µ = 71 ) and the population standard deviation ( σ = 2.3 ) have been supplied. We also know that we are dealing with the Central Limit Theorem because the question is asking for the mean of the x ’s (i.e. mean of the sample means) and the standard deviation of the x ’s (i.e. standard deviation of the sample means). Mean of the Sample Means µx = µ µ x = 71 Standard Deviation of the Sample Means σx = σ n 2.3 σx = 30 σ x = 0.42 Questions 1. Match each term with the appropriate symbol listed below. Symbol (a) Sample Mean (b) Standard Deviation of the Sample Means (c) Sample Standard Deviation (d) Population Mean (e) Sample Size (f) Mean of the Sample Means (g) Population Standard Deviation Symbols µ NSSAL ©2010 Sx µx σ x 62 σx n Draft C. D. Pilmer 2. A random sample of size 45 is to be selected from a population with mean µ = 329 and standard deviation σ = 27 . (a) If samples of the same size are repeatedly collected from this population, what would the mean of the sample means be equal to? (b) If samples of the same size are repeatedly collected from this population, what would the standard deviation of the sample means be equal to? 3. Random samples of size 60 are repeatedly selected from a known population with a mean of 87 and a standard deviation of 7.2. These repeatedly collected samples allow a sampling distribution of the x ’s to be drawn. (a) What type of distribution (uniform, bimodal, normal, or skewed) would result? (b) Determine the mean of the sample means and indicate where it would be located on the distribution of the x ’s. (c) Determine the standard deviation of the sample means. (d) What percentage of the sample means would be within one σ x of the population mean? (e) Between what two values would one expect 95% of the sample means to fall? 4. Researchers examined the speeds traveled by motorists on a specific section of a highway in the month of August. The researchers found that the population mean was 106.2 km/h with a population standard deviation of 4.1 km/h. We collect a random sample of 55 motorist speeds from this unknown population. We then repeatedly collect samples of the same size so that a sampling distribution of mean motorist speeds can be constructed. Where is resulting distribution centred and how much is it spread out about its centre? NSSAL ©2010 63 Draft C. D. Pilmer 5. The mean weight of baggage checked in by an individual adult passenger boarding a domestic flight is 28.5 kg with a standard deviation of 5.0 kg. A sample of size 30 is taken from this known population. The data points are shown below in the chart. 29.1 32.9 37.6 (a) (b) (c) (d) (e) (f) (g) (h) (i) 28.3 26.2 26.8 29.5 30.3 28.4 24.7 29.1 33.0 32.4 31.7 33.2 25.4 23.7 28.3 22.4 31.7 20.4 25.1 28.2 26.7 23.3 28.9 22.8 18.1 21.9 37.8 What is the population mean? What is the population standard deviation? Determine the sample mean. What does the sample mean represent in this situation and is it close to the expected value? Determine the sample standard deviation. If samples of the same size are repeatedly collected from this known population, what would be the value of the standard deviation of the sample means? If samples of the same size are repeatedly collected from this known population, what would be the value of the mean of the sample means? For the sampling distribution of x ’s, between what two values would one expect 68% of the sample means to fall? For the sampling distribution of x ’s, between what two values would one expect 95% of the sample means to fall? NSSAL ©2010 64 Draft C. D. Pilmer 6. Explain in your own words what the difference is between the sample standard deviation and the standard deviation of the sample means. 7. Suppose we were to sample from a known population. In each of three cases, determine which phrase would best describe the resulting distribution. Answer (a) 500 random samples of size 40 are collected from a known population and 500 sample means are generated using these samples. (b) A random sample of size 40 is collected from a known population allowing a distribution of x’s to be drawn. (c) Random samples of size 40 are repeatedly collected from a known population allowing a distribution of x ’s to be drawn. Choices: (i) a distribution of data points (ii) a sampling distribution of sample means (iii) a rough approximation of the sampling distribution of the sample means 8. The Valley Apple Growing Association knows that the mean weight of a particular type of apple grown in their county for sale in supermarkets is 86 grams with a standard deviation of 3.7 grams. Let x be the mean weight for a random sample of 52 apples. (a) What is the mean value and standard deviation of the distribution of all possible x ’s? (b) Between what two values would one expect 99.7% of the x ’s to fall? 9. Random samples of the same size are repeatedly collected from a known population with a mean of 98.6 and a standard deviation of 10.8. Determine the mean and standard deviation of the sampling distribution of all possibles x ’s for each of the following sample sizes. (a) n = 40 (b) n = 60 (c) n = 80 NSSAL ©2010 65 Draft C. D. Pilmer 10. Look at the previous question. How does the standard deviation of the sample means change as the sample size increases? Is this what we would expect? Explain. 11. Three sampling distributions of sample means have been created using the same known population; however, three different sample sizes were used. One used repeatedly collected random samples of size 30. The other two used repeatedly collected random samples of size 60 and 90. (i) (ii) (iii) x (a) What is the population mean for this known population? (b) Match the three sampling distributions (i, ii, iii) to the appropriate sample sizes (30, 60, 90). Briefly explain how you arrived at these answers. NSSAL ©2010 66 Draft C. D. Pilmer 12. Four different sampling distributions have been plotted on the same axes. Two of the sampling distributions come from the same population; however; the sizes of repeatedly collected samples differ. The same is true with the other two sampling distributions of the sample means. Match the description with the appropriate distribution. (i) (iii) (ii) (iv) x Answer (a) (b) (c) (d) NSSAL ©2010 Population mean is 70 and repeatedly collected samples are of size 80. Population mean is 40 and repeatedly collected samples are of size 80. Population mean is 70 and repeatedly collected samples are of size 40. Population mean is 40 and repeatedly collected samples are of size 40. 67 Draft C. D. Pilmer Point Estimates and Interval Estimators In the last three sections, we have learned the following. • The population mean is fixed. • The sample mean is random. • If random samples of the same size are repeatedly collected from a known population, the resulting sampling distribution of the sample means displays three distinct properties defined by the Central Limit Theorem. So what does this have to do with inferential statistics? In other words, how do we use this information be help us make inferences about a population based on a sample? In the section titled Simulated Sampling, we simulated the collection of four samples of size 40 from a known population. In this case, the population mean was 412 µg / m3 of contaminant in the air with a population standard deviation of 38 µg / m3 . Here are the results another adult learner obtained when she completed the activity. Sample 1: x = 412.3 Sample 3: x = 413.2 Sample 2: x = 411.9 Sample 4: x = 413.0 Notice that all of these sample means are fairly close to the population mean (412 µg / m3 ). We can use a sample mean obtained from one sample to represent a plausible value for the population mean. A single sample mean is called a point estimate because this single value is used as a plausible value for the population mean. In inferential statistics, we prefer to report an interval of reasonable values based on a sample, rather than a single plausible value (point estimate or sample mean). This interval of reasonable values is called an interval estimator. The interval estimator of the population mean is called the confidence interval. Associated with every confidence interval is a confidence level. The confidence level indicates the level of assurance we have that the resulting confidence interval encloses the population mean. Example 1 Taylor works as a quality control officer at a compact fluorescent light bulb factory. She wants to understand how long on average one of these light bulbs lasts. She randomly selects 40 new bulbs off of the assembly line, and takes them to see how long each will last. Rather than reporting the mean lifespan of the 40 bulbs (i.e. sample mean/point estimate), she decides to report the following confidence interval (i.e. interval estimator). She reports that the population mean lifespan of this type of bulb is between 5880 hours and 6130 hours with 95% confidence. What does this last sentence mean? Answer: Confidence intervals are constructed in a specific manner that we will learn about later in this section. In this case, Taylor’s confidence interval is between 5880 hours and 6130 and has a confidence level of 95%. The sentence means that the method that produced this interval NSSAL ©2010 68 Draft C. D. Pilmer from 5880 to 6130 has a 0.95 probability of enclosing the true mean lifespan (i.e. population mean) of these light bulbs. Therefore there is a 0.05 probability that this method does not create an interval that encloses the true mean lifespan (i.e. population mean). It does not mean that there is a 0.95 probability that the population mean falls within the interval from 5880 to 6130. You are probably asking yourself how this sentence differs from the one in the previous paragraph. It has to do with the fact that the population mean is fixed and the sample mean (which a confidence interval is derived from) is random. The incorrect meaning states that the “population mean falls within the interval.” This statement implies that the population mean is random, rather than fixed. For this reason, the explanation is wrong. One way to visualize the correct meaning of a confidence interval is to think about a parachutist trying to hit a target on the ground. The target, which is fixed to the ground, is our population mean. The parachutist with the big parachute is the confidence interval. We would like some portion of the parachute to hit the target, but there is a possibility that the parachute might miss the target all together. In the diagram below, the confidence interval (width of the parachute) will enclose the population mean (the target). Width of Parachute (Confidence Interval) Target (Population Mean) In the diagram below we have three parachutes (three confidence intervals). Two of these parachutes will enclose the target (population mean), but one will not. Target NSSAL ©2010 69 Draft C. D. Pilmer When we are dealing with a 95% confidence level, we are saying that 95 out of 100 confidence intervals should enclose the population mean. Thinking about our parachuting analogy, one would say that 95 out of 100 of the parachutes would enclose the target and 5 of the 100 would not. The parachuting analogy makes sense when talking about a known population where we know the population mean (i.e location of the target). In the real world, we use the confidence interval as a set of plausible values for the population mean that may enclose that unknown mean (i.e. We do not know the location of the target.). Example 2 The Department of Health randomly selected 200 males between the ages of 25 and 30. They recorded the resting heart rate of these individuals. Rather than reporting the mean resting heart rate, they reported the following. The population mean resting heart rate for males (ages 25 years to 30 years) is between 79 beats per minute and 83 beats per minute with 90% confidence. Explain what is meant by the last sentence. Answer: The Department of Health is reporting a confidence interval that goes from 79 to 83. This particular confidence interval was calculated with a 90% confidence level. They are stating that the method used to construct their confidence interval has a 0.9 probability of enclosing the population mean resting heart rate. There is a small probability (0.1) that this method creates an interval that does not encloses the population mean. Well it is great that we know what a confidence interval is and understand what it means but how do we calculate it and what does it have to do with the Central Limit Theorem? Developing the Confidence Interval The development of the confidence interval is tied directly to our understanding of the sampling distribution of the sample means and hence the Central Limit Theorem. For the sampling distribution of the sample means, we learned that the approximately 95% of the sample means will be within 2 standard deviations of the population mean (more precisely 1.96 standard deviation of the population mean). In this case the standard deviation of the sample means is defined as follows. σx = σ n Visually 95% of the sample means are between within the region on the diagram below. µ 1.96 σ n 1.96 σ n x NSSAL ©2010 70 Draft C. D. Pilmer Get ready for it; here’s the big conceptual jump that you will have to think about. If a single sample mean ( x ) is within 1.96 x − 1.96 σ n to x + 1.96 σ n σ n of the population mean, then the interval from will enclose the population mean. This can be seen in the diagram below. Three sample means within the 1.96 σ n of the population mean have between drawn below the sampling distribution. We then went 1.96 σ n to the left and right of each of these sample means to create our desired interval. Notice that all three of these intervals enclose the population mean (i.e. cross the vertical line in the center representing the population mean, µ ). µ 1.96 σ 1.96 n σ n x x − 1.96 σ x x + 1.96 n x − 1.96 x − 1.96 NSSAL ©2010 σ σ σ n x x + 1.96 n x x + 1.96 n 71 σ n σ n Draft C. D. Pilmer σ If, however, that sample mean is not within 1.96 of the population mean, then the resulting n interval constructed using that sample mean will not enclose the population mean. Such is the case in the diagram below. µ 1.96 σ n 1.96 σ n x x − 1.96 This interval from x − 1.96 σ to x + 1.96 σ n x x + 1.96 σ n σ is called a 95% confidence interval. This interval n n is a range of plausible values for the population mean that may enclose the population mean. It will enclose the population for 95% of the samples. The Confidence Interval Formula First Draft of the Formula: When dealing with a random sample of size 30 or greater, the confidence interval based on a sample mean is calculated using the following formula. x±z σ n If we are calculating a 90% confidence interval then z equals 1.645. If we are calculating a 95% confidence interval then z equals 1.96. If we are calculating a 99% confidence interval then z equals 2.56. This confidence interval formula requires that we know the population standard deviation ( σ ). In the real world we use samples (and their resulting confidence intervals) to make inferences about an unknown population. If it is unknown population, then we will not know the population standard deviation ( σ ). We need another approach. If the sample is large ( n ≥ 30 ), then we can replace the population standard deviation with the sample standard deviation. NSSAL ©2010 72 Draft C. D. Pilmer Second (and Final Draft) of the Formula: When dealing with a random sample of size 30 or greater, the confidence interval based on a sample mean is calculated using the following formula. S x±z x n If we are calculating a 90% confidence interval then z equals 1.645. If we are calculating a 95% confidence interval then z equals 1.96. If we are calculating a 99% confidence interval then z equals 2.56. This second and final draft of the confidence interval formula is the one we will use. Example 3 Samir conducted a study where he examined the concentration of a particular airborne contaminant in 250 randomly selected households from across Canada. The sample mean and sample standard deviation were 413.2 µg / m3 and 40.5 µg / m3 respectively. (a) Determine the 90% confidence interval. (b) In question (a), did we generate an interval estimator or a point estimate? (c) Explain what this confidence interval means. (d) After completing his study Samir learns that the federal government had tested every household for this particular airborne contaminant and found the population mean was 412 µg / m3 with a population standard deviation of 38 µg / m3 . Does the interval derived from Samir’s sample enclose the population mean? (e) If he collected 200 samples of size 40 and worked out 200 confidence intervals each with a 90% confidence level, how many would we expect to enclose the population mean? (f) Suppose Samir took a sample of size 400 and the resulting mean and standard deviation were 411.7 µg / m3 and 36.4 µg / m3 respectively. Determine the 99% confidence interval and state whether it encloses the population mean. Answers: (a) x ± z Sx n 413.2 ± 1.645 40.5 250 413.2 ± 4.21 From 408.99 to 417.41 (b) A confidence interval is an interval estimator. (c) We are stating that the method used to construct the interval from 408.99 µg / m3 to 417.41 µg / m3 has a 0.9 probability of enclosing the true mean air contaminant level for all households in Canada (i.e. population mean). There is a 0.1 probability that this method does not create an interval that encloses the population mean. (d) We were told that the population mean is 412 µg / m3 . The interval from 408.99 µg / m3 to 417.41 µg / m3 encloses the population mean. NSSAL ©2010 73 Draft C. D. Pilmer (e) 90% of 200 = 180 We would expect that 180 out of the 200 confidence intervals would enclose the population mean. S (f) x ± z x n 36.4 411.7 ± 2.56 400 411.7 ± 4.66 From 407.04 to 416.36 ← This interval encloses the population mean (412 µg / m3 ). Example 4 Jamie and Angela each conduct a study where they record the weights of randomly selected 10 year old males. The weights in pounds for these two samples are recorded below. Jamie’s Sample: 81.2 110.7 101.4 112.7 112.7 104.8 113.7 91.7 102.9 116.0 107.0 109.9 107.1 85.6 83.1 99.5 85.4 113.2 97.7 95.4 114.6 116.0 101.8 111.1 102.3 108.3 83.8 97.3 112.5 85.6 Angela’s Sample: 90.7 100.5 106.7 87.6 91.5 90.1 102.6 85.4 106.9 88.9 114.3 71.4 122.1 98.2 108.8 84.9 106.2 100.3 115.8 91.5 86.1 109.5 95.9 101.7 95.6 84.7 75.8 96.3 104.8 98.7 80.4 103.0 120.1 84.8 110.2 118.7 (a) (b) (c) (d) (e) Determine the 95% confidence interval for Jamie’s sample. Explain what the confidence interval from (a) means. Determine the 95% confidence interval for Angela’s sample. Which confidence interval has a greater probability of enclosing the population mean? Do either of the confidence intervals enclose the mean? Answers: In this question, we cannot calculate either of the confidence intervals without the sample means ( x ) and sample standard deviations ( S x ) for the two samples. We will enter the data points in our TI-83 or TI-84 calculators and use the 1-Var Stats command to obtain the desired means and standard deviations. S (a) x ± z x n 11.1 102.2 ± 1.96 30 102.2 ± 3.97 From 98.23 to 106.17 (b) It means that method used to produce the interval from 98.23 pounds to 106.17 pounds has a 0.95 probability of enclosing the true mean weight of 10 year old males (i.e. NSSAL ©2010 74 Draft C. D. Pilmer population mean). There is a 5% chance that this method creates an interval that does not enclose the population mean. S (c) x ± z x n 12.6 98.1 ± 1.96 36 98.1 ± 4.12 From 93.98 to 102.22 (d) This is a trick question. These confidence intervals have the same confidence level (95%) therefore the methods used to create both intervals have the same probability of enclosing the population mean. (e) We cannot tell if either of these confidence intervals encloses the population mean because the population mean is not supplied. We are dealing with an unknown population. If you wish further clarification go to the Mathematics Multimedia Learning Objects (see page iv), access Unit 11-5 Statistics, and view MLO15 Using Confidence Intervals. Questions 1. Brian wants to know how much on average Nova Scotian households spend on electricity in the month of December. He could not get permission from the power corporation to access their records for that month so he decided to collect a random sample of size 300. After analyzing the data, he reports that the population mean power bill for Nova Scotian households is between $292 and $304 with 95% confidence. Explain what this last sentence means. 2. Barb collects a sample of size 98 from an unknown population. She calculates the sample mean and finds that it is equal to 583.2. The sample standard deviation works out to be 32.1. (a) Determine the 99% confidence interval based on this sample. (b) Explain what this confidence interval means. (c) Does the confidence interval enclose the population mean? (d) If we collected 500 samples of the same size from the same population and then generated five hundred 99% confidence intervals, how many would one expect to enclose the population mean? NSSAL ©2010 75 Draft C. D. Pilmer 3. Dr. Saad conducted a medical study where he recorded the resting heart rate of 32 randomly selected 18 year old girls. The data in beats per minute is supplied below. 79 71 78 71 76 70 69 66 76 84 77 67 78 78 87 75 65 72 68 72 77 73 70 72 72 81 84 82 66 89 76 72 (a) Determine x . Is it a point estimate or interval estimator? (b) Calculate the 95% confidence interval. Is it a point estimate or interval estimator? (c) When Dr. Saad is asked to explain the meaning of the resulting 95% confidence interval he responds, “There is a 0.95 probability that the true mean resting heart rate of 18 year old girls falls within the interval we just calculated.” Is his interpretation correct? Explain. (d) If he collected 400 samples of size 32 and created four hundred confidence intervals with the same confidence level as above, how many would one expect not to enclose the true mean resting heart rate? Would he be able to determine which intervals did not enclose the population mean? 4. Monica and Kadeer conducted two separate studies that looked at the daily water consumption of randomly selected adult Nova Scotians. The data reported in litres is listed below. Monica’s Data: 360 366 300 313 223 348 343 299 340 330 317 303 254 335 345 368 402 362 306 281 405 321 366 303 393 289 339 444 377 299 306 285 429 Kadeer’s Data: 363 297 271 303 300 330 351 322 311 305 319 359 383 388 321 338 220 271 364 350 309 299 323 320 375 304 308 361 354 359 341 302 390 307 290 325 (a) Determine the 90% confidence interval based on Monica’s data. (b) Determine the 99% confidence interval based on Kadeer’s data. (c) Which method used to create the two confidence intervals has a greater probability of enclosing the true mean daily water consumption? NSSAL ©2010 76 Draft C. D. Pilmer 5. Maurita collects a sample of size 56 from an unknown population. The sample mean works out to be 148.0 and the sample standard deviation works out to be 17.4. (a) Determine the 90% confidence interval for this sample. (b) Determine the 95% confidence interval for this sample. (c) Determine the 99% confidence interval for this sample. (d) How does the confidence level affect the confidence interval? (e) If µ = 143.9 , then did all three confidence intervals enclose the desired value? 6. Rana collects three samples of differing sizes from the same unknown population. We have “cooked” the results so that the sample standard deviations remain the same for the three samples. The reason for this will become apparent as you progress through the questions. (a) Determine the 95% confidence interval for a sample of size 30 with a sample mean of 53.8 and sample standard deviation 4.89. (b) Determine the 95% confidence interval for a sample of size 100 with a sample mean of 54.9 and sample standard deviation 4.89. (c) Determine the 95% confidence interval for a sample of size 250 with a sample mean of 54.3 and sample standard deviation 4.89. (d) Does the sample size affect the width of the confidence interval? Explain. 7. Which of these factors affect the width of confidence intervals? Simply indicate with a check mark. ____ Population Mean ____ Sample Size ____ Confidence Level ____ Sample Mean NSSAL ©2010 77 Draft C. D. Pilmer 8. Indicate whether each of the following statements are true or false. _________ (a) Once we calculate a confidence interval, the population mean may or may not be enclosed by that interval. _________ (b) There is a 95% chance that a 95% confidence interval will include the sample mean. _________ (c) A sample mean is an example of a point estimate. _________ (d) If we are sampling from the same population and using the same sample size, then higher confidence levels produce wider intervals than lower confidence levels. _________ (e) If we are sampling from the same population and constructing confidence intervals with the same confidence levels, then larger sample sizes produce wider intervals than those from smaller sample sizes. _________ (f) For a 99% confidence interval, there is a 0.99 probability that the population mean will fall between the two values. _________ (g) Approximately 90% of the data points in a sample are enclosed within the 90% confidence interval 9. Water from 70 different rainfalls in Nova Scotia were analyzed for acidity (pH). The mean pH reading was 6.2 with a standard deviation of 0.5. Determine the 95% confidence interval for the mean acidity and explain what the interval represents. 10. Go to the following website. http://www.ruf.rice.edu/~lane/stat_sim/conf_int erval/ (or Google Search: Confidence Interval Applet RVLS) Read the instructions and then click on the BEGIN icon. The window shown on the right will appear on the screen. Press SAMPLE and examine the diagram on the left and the chart at the bottom of the window. Press the SAMPLE button again. What is this applet trying to show you? NSSAL ©2010 78 Draft C. D. Pilmer 11. A large national department store chain that offers extended warranties on its products wants to know how long a particular brand of washing machine will last before needing maintenance. They randomly selected customers who purchased this machine and asked them how long their machine lasted before requiring maintenance. The data reported in months is listed below. 56 47 45 50 42 51 49 41 49 49 46 44 45 49 46 51 45 50 46 46 44 45 41 52 51 51 44 56 54 51 55 44 45 49 48 43 49 50 45 46 (a) Calculate the 90% confidence interval. (b) Did the interval enclose the population mean? Explain. (c) If you collected another sample of size 40, would you expect the confidence interval to change? Explain. (d) If the confidence level is changed from 90% to 99%, how would that affect the width of the confidence interval? (e) If the sample size was changed from 40 to 100, how would that affect the width of the 90% confidence interval? NSSAL ©2010 79 Draft C. D. Pilmer Putting It Together Before we start working on review questions, we should look at the various sections of this unit and the terms that were introduced in each of those sections. Introductory Materials and Terminology - Descriptive Statistics, Inferential Statistics, Population, Sample, Categorical Data, Discrete Numerical Data, Continuous Numerical Data Bar Graphs and Histograms - Bar Graphs, Histograms, Distributions (Uniform, Skewed, Normal, Bimodal) Describing Data, Part 1 - Population Mean, Sample Mean, Median, Outliers, Trimmed Means Describing Data, Part 2 - Population Standard Deviation, Sample Standard Deviation Normal Distribution - 68-95-99.7 Rule Collecting a Sample - Unbiased Sample, Sample Size Sampling Methods - Simple Random Sample, Cluster Sample, Stratified Sample, Systematic Sample, Voluntary Response Sample, Convenience Sample Central Limit Theorem - Sampling Distribution of the Sample Means, Mean of the Sample Means, Standard Deviation of the Sample Means Point Estimates and Interval Estimators - Point Estimate, Interval Estimators, Confidence Interval, Confidence Level Questions: 1. The manager of the community sportsplex wanted to know how the 1386 members might feel about the discussion concerning an addition to the existing building that included a 25 metre, 8 lane pool. He asked 230 randomly selected members if they were willing to pay an additional $35 a year on their membership fee to have these new features. Describe the population and the sample for this situation. NSSAL ©2010 80 Draft C. D. Pilmer 2. For each of the following, state whether the data collection would result in a categorical data set or numerical data set. If the data is numerical, indicate whether we are dealing with discrete or continuous data. (a) The number of pets in Nova Scotian households (b) The type of MP3 player owned by adults. (c) The diameter of the trunk of spruce trees growing in a particular valley. (d) The size of T-shirts worn by boys between the ages of 16 and 18 years (e) The number of children traveling more than 1.5 kilometres to school. (f) The time to complete a driver’s license renewal at a specific Access Nova Scotia location 3. If you were collecting a random sample in each situation, what type of distribution (normal, uniform, bimodal, skewed) would you likely obtain? Distribution Type (a) Hodgkin’s lymphoma is a type of cancer that originates from white blood cells. This disease typically affects people either in early adulthood or when they are 55 years of age or older. You randomly select 250 patients with Hodgkin’s lymphoma and ask them to report the age of their initial diagnosis. What would the distribution of ages likely look like? (b) Most people make under $40,000 a year, but some make quite a bit more, with a smaller number making many millions of dollars a year. What would the distribution of yearly earnings likely look like? (c) James is working as a biologist for the summer and measuring the circumferences of randomly selected maple trees in a natural growth forest. What would the distribution of circumferences likely look like? (d) You use the random number generator on your calculator to find 500 random whole numbers between 1 and 10. What would the distribution of numbers likely look like? 4. An airline company randomly selected eighteen suitcases from domestic flights and recorded their weights in kilograms. 16.2 11.3 15.7 14.7 15.1 19.6 16.0 14.1 3.9 18.0 14.8 16.3 13.6 11.9 12.4 14.8 13.5 19.7 (a) Although the airline collected a sample, describe the population in this situation. NSSAL ©2010 81 Draft C. D. Pilmer (b) Would a histogram or bar graph be used with this data set? (c) Calculate the mean, median, and 5% trimmed mean without using the STAT feature on a TI-83/84 calculator. (d) Which of these measures is not influenced or less influenced by extremely high or low data points? 5. A study looked at the concentration of iron in the bloodstream of ten randomly selected high performance female athletes. The following data was collected. The concentrations are measured in grams per decilitre (g/dl). 15.3 14.2 13.6 11.9 14.8 12.6 14.6 13.9 14.2 12.9 (a) Are we dealing with a population or a sample? (b) Calculate the mean without using the STAT features on your calculator. Use the appropriate symbol. (c) Calculate the standard deviation without using the STAT features on your calculator.. xi NSSAL ©2010 82 Draft C. D. Pilmer 6. The body mass index of 600 randomly selected 20 year old males was taken. The sample mean was 23.0 kg/m2 and the sample standard deviation 2.5 kg/m2. Assume that the distribution of body mass indexes was bell shaped. (a) Approximately how many 20 year old males had body mass indexes between 23.0 kg/m2 and 25.5 kg/m2? (b) Approximately how many 20 year old males had body mass indexes between 18.0 kg/m2 and 23.0 kg/m2? (c) Approximately how many 20 year old males had body mass indexes between 15.5 kg/m2 and 30.5 kg/m2? (d) Approximately how many 20 year old males had body mass indexes between 20.5 kg/m2 and 28.0 kg/m2? (e) Approximately how many 20 year old males had body mass indexes between 18.0 kg/m2 and 30.5 kg/m2? (f) Approximately how many 20 year old males had body mass indexes between 15.5 kg/m2 and 25.5 kg/m2? (g) Approximately how many 20 year old males had body mass indexes between 25.5 kg/m2 and 28.0 kg/m2? (h) Approximately how many 20 year old males had body mass indexes between 15.5 kg/m2 and 18.0 kg/m2? NSSAL ©2010 83 Draft C. D. Pilmer 7. Identify the sampling method used. Also indicate whether we are dealing with a preferred and poor sampling method. (a) A cable company wanted to know how its customers felt about upgrading the high definition (HD) television signal from 760p to 1080p. There would be a small increase in the monthly bill for this upgrade. As customers have signed up for the regular HD television (760p), they have been assigned a six digit customer identification number starting at 000000. The company wants to ask every hundredth HD customer about the potential upgrade. They randomly generate the number 83, and then ask every customer whose identification number ends with these two digits to respond to the company’s survey. Method: ______________________________________________ (b) A small community has 1000 adult residents. The community leaders want to know how the residents feel about the rezoning of some municipal property so that a small strip mall can be built. The leaders want to collect a sample of size 120. Each resident is assigned a number 000 through to 999. They take the ball machine from the local bingo hall and fill it with three sets of ping pong balls, each set numbered 0 through 9. The machine is turned on three balls are extracted. Those three numbers correspond to the three digit number assigned to one resident. The three balls are returned to the machine and this process is repeated 119 more times. The leaders now know which of their residents will be asked to partake in the survey. Method: ______________________________________________ (c) A national hardware store wants to know how its customers feel about the service, products, and prices. At the bottom of every receipt, they include a website. If the customer chooses to visit the site, they can answer a series of questions and have an opportunity to win a prize. Method: ______________________________________________ (d) A large hotel in a large city wants to know how much its customers spend at their hotel during an overnight visit. The hotel already knows that 80% customers are there on business while only 20% are there for leisure. They suspect that the spending habits for these two groups may be quite different so they create a random sampling technique that ensures that both groups are proportionally represented in the survey. Method: ______________________________________________ (e) Ontario has 211 hospitals. The health authority wants to understand the demands that are presently being put on emergency room staff. Rather than interviewing every ER staff member at every hospital, they randomly select 10 hospitals and interview every ER staff member at those ten facilities. Method: ______________________________________________ NSSAL ©2010 84 Draft C. D. Pilmer 8. Random samples of size 75 are repeatedly selected from a known population with a mean of 107.2 and a standard deviation of 9.3. These repeatedly collected samples allow a sampling distribution of all possible x ’s to be drawn. (a) What type of distribution (uniform, bimodal, normal, or skewed) would result? (b) Determine the mean of the sample means and indicate where it would be located on the distribution of the x ’s. (c) Determine the standard deviation of the sample means. 9. The mean cost of a lunch at a particular eating establishment is $11.52 with a standard deviation of $1.47. A sample is taken from this known population. The data points are shown below in the chart. 10.63 9.58 12.43 10.66 (a) (b) (c) (d) (e) (f) (g) (h) (i) 11.85 11.64 11.58 11.52 9.12 10.12 14.05 11.09 12.20 13.00 11.22 10.05 11.39 8.77 13.11 12.46 12.50 11.21 10.41 9.76 12.99 13.03 14.15 11.06 12.27 11.42 13.73 14.06 14.10 9.74 12.38 11.21 12.80 What is the population mean? What is the population standard deviation? Determine the sample mean. What does the sample mean represent in this situation and is it close to the expected value? What is the sample size? Determine the sample standard deviation. If samples of the same size are repeatedly collected from this known population, what would be the value of the standard deviation of the sample means? If samples of the same size are repeatedly collected from this known population, what would be the value of the mean of the sample means? For the sampling distribution of x ’s, between what two values would one expect 68% of the sample means to fall? NSSAL ©2010 85 Draft C. D. Pilmer 10. (a) Does the sample size affect the mean of the sample means? Explain. (b) Does the sample size affect the standard deviation of the sample means? Explain. 11. Meera collects a sample of size 125 from an unknown population. She calculates the sample mean and finds that it is equal to 287.1. The sample standard deviation works out to be 25.7. (a) Determine the 90% confidence interval based on this sample. (b) Explain what this confidence interval means. (c) If we collected 500 samples of the same size from the same population and then generated five hundred 90% confidence intervals, how many would one expect to enclose the population mean? 12. Dr. Bagnell conducted a medical study where she recorded the height in centimetres of 36 randomly selected 20 year old males. The data is supplied below. 177 177 179 176 182 176 192 172 183 171 185 184 192 180 178 174 184 167 172 179 179 184 171 184 182 176 178 177 172 172 178 173 180 181 173 179 (a) Determine x . Is it a point estimate or interval estimator? (b) Calculate the 95% confidence interval. Is it a point estimate or interval estimator? (c) Calculate the 99% confidence interval. (d) Which of the methods used to create the two confidence intervals has a greater chance of enclosing the true mean weight of 20 year old males? Explain. NSSAL ©2010 86 Draft C. D. Pilmer 13. The head circumferences of 150 randomly selected infants (20 months of age) were recorded. The mean circumference reading was 48.1 centimetres with a standard deviation of 1.2 centimetres. Determine the 95% confidence interval and explain what the interval represents. Does the interval enclose the true mean head circumference? 14. Computer equipment can be sensitive to high temperatures. Leck Electronics wanted to test a particular computer component to determine at what temperature the component would fail. They randomly selected 35 of the same component, exposed them to increasing temperatures, and recorded the temperature (oC) at which the component failed. 35.2 30.9 35.6 34.1 31.8 33.7 33.5 30.2 33.4 38.4 33.0 33.1 30.6 33.1 31.5 34.1 28.0 31.5 29.7 29.2 34.1 28.1 33.3 33.5 33.7 30.6 38.4 36.9 33.2 32.8 37.4 36.9 34.5 33.3 33.8 (a) Calculate the 90% confidence interval for the mean failure temperature. (b) If you collected another sample of size 35, would you expect the confidence interval to change? Explain. (c) If 300 random samples of size 35 were obtained and three hundred 90% confidence intervals were constructed, approximately how many would you expect not to enclose the population mean? (d) If the sample size was changed from 35 to 200, how would that affect the width of the 90% confidence interval? (e) If the confidence level is changed from 90% to 95%, how would that affect the width of the confidence interval? NSSAL ©2010 87 Draft C. D. Pilmer If You Have the Time We have spent the last few weeks looking at descriptive and inferential statistics. Although we examined several statistical tools in real world applications, we have not seen how statistical information can dramatically change our understanding of the world. There are, however, two fascinating online videos that do just that. The first video features Hans Rosling, a Swedish physician and professor of Internal Health. He uses statistics to show how we must change our perceptions of other countries, particular those that we deem as third world. He even uses confident intervals in his presentation to show that Swedish undergraduate university students performed worse than chimpanzees on a test of international child mortality rates. This video can be viewed at the following site. TED Hans Rosling Shows the Best Stats You've Ever Seen http://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen.html The second video features Peter Donnelly, an Australian statistician working at the University of Oxford. He presents several real world examples where people, including professionals, have difficulty reasoning with uncertainty and the implications of such shortcomings. TED Peter Donnelly Shows How Stats Fool Juries http://www.ted.com/talks/lang/eng/peter_donnelly_shows_how_stats_fool_juries.html Optional Assignment With your instructor’s permission, you may wish to negotiate an optional assignment based on one or both of these videos. NSSAL ©2010 88 Draft C. D. Pilmer Post-Unit Reflections What is the most valuable or important thing you learned in this unit? What part did you find most interesting or enjoyable? What was the most challenging part, and how did you respond to this challenge? How did you feel about this topic when you started this unit? How do you feel about this topic now? Of the skills you used in this unit, which is your strongest skill? What skill(s) do you feel you need to improve, and how will you improve them? How does what you learned in this unit fit with your personal goals? NSSAL ©2010 89 Draft C. D. Pilmer Terms, Symbols, and Formulas By the end of this unit, you should be familiar with the following terms, symbols, and formulas. These have been presented in the order that they appear in this resource. Descriptive Statistics Inferential Statistics Population Sample Categorical Data Set Numerical Data Set Discrete Numerical Data Continuous Numerical Data Bar Graph Histogram Normal Distribution Uniform Distribution Skewed Distribution Bimodal Distribution Population Mean, µ Sample Mean, x x1 + x 2 + x3 + ... + x n n x1 + x 2 + x3 + ... + x n x= n µ= Median Trimmed Mean, x(T ) Population Standard Deviation, σ Sample Standard Deviation, S x σ= (x1 − µ )2 + (x2 − µ )2 + (x3 − µ )2 + ... + (xn − µ )2 Sx = (x n ) ( ) ( 2 1 2 ) 2 ( − x + x 2 − x + x3 − x + ... + x n − x n −1 Frequency Polygon 68-95-99.7 Rule Sample Size, n Simple Random Sample Stratified Random Sample Cluster Sample Systematic Sample Convenience Sample Voluntary Sample Sampling Distribution of the Sample Means Mean of the Sample Means, µ x µx = µ Standard Deviation of the Sample Means, σ x σx = ) 2 σ n Central Limit Theorem NSSAL ©2010 90 Draft C. D. Pilmer Point Estimate Interval Estimator x±z Confidence Interval Based on a Sample Mean Sx n Confidence Level NSSAL ©2010 91 Draft C. D. Pilmer TI-83/84 Statistics Information Sheet The following commands are used throughout this unit. 1. The 1-Var Stats Command This command allows one to determine the mean ( x ), median, sample standard deviation ( S x ), and population standard deviation ( σ ) for data entered into a list on the calculator. The command also generates other values but none of these will be used in this course. STAT > CALC > 1-Var Stats > Enter the list name. (L1, L2,…) > ENTER 2. The SortA command This command sorts data in a specific list in ascending order (i.e. smallest to largest). STAT > SortA( > Enter the list name. > ENTER (L1, L2,…) 3. The rand Command This command generates a random number between 0 and 1. MATH > PRB > rand > In a set of brackets, indicate the number of random numbers you wish to generate. 4. The randNorm Command This command allows one to simulate the collection a random sample of a specific size from a known population that is normally distributed. MATH > PRB > randNorm > Enter the population mean, population standard deviation, and sample size, all separated by commas. Close the brackets. NSSAL ©2010 92 Draft C. D. Pilmer 5. The mean Command This command finds the mean for data in a specific list. LIST > MATH > mean( > Enter the list name. > ENTER (L1, L2,…) 6. The seq Command The command generates a sequence of numbers. LIST > OPS > seq( NSSAL ©2010 93 Draft C. D. Pilmer Answers Introductory Materials and Terminology (pages 1 to 5) 1. Population: all the taxpayers in this community (4127) Sample: the 300 randomly selected taxpayers 2. Population: all the used bricks that the contractor purchased (6000) Sample: the 200 randomly selected bricks that were examined to determine usability 3. Population: all of the employed workers in Nova Scotia (453 000) Sample: the 1200 randomly selected employed workers who participated in the survey and reported their annual gross income 4. Population: all of the adults who received a high school diploma from NSSAL between 2001 and 2009 Sample: the 240 randomly selected NSSAL graduates who participated in the interview 5. (a) (c) (e) (g) (i) numerical (continuous) categorical numerical (discrete) categorical numerical (discrete) (b) (d) (f) (h) categorical numerical (continuous) numerical (continuous) numerical (continuous) 6. (a) Quebec City, 340 cm per year (b) 50 cm per year (c) numerical data set – cities are reporting annual snowfalls in centimetres 7. (a) (b) (c) (d) (e) 187 people 231 people 176 people sample (reason: only 1100 of all the citizens were selected) categorical data set 8. (a) (b) (c) (d) 58% 14% 1991 and 1996 population (reason: census) Bar Graphs and Histograms (pages 6 to 10) 1. (a) (c) (e) (g) bar graph histogram bar graph bar graph NSSAL ©2010 (b) histogram (d) bar graph (f) histogram 94 Draft C. D. Pilmer 2. (a) 2 (b) 16 % 3 (c) skewed left (d) Because we are dealing with continuous numerical data (e) sample 3. (a) population (b) (c) normal 4. (a) uniform (c) skewed right (e) skewed left (b) bimodal (d) normal Describing Data, Part 1 (pages 11 to 16) 1. (a) sample (b) x = 6.2 NSSAL ©2010 Median = 6 95 Draft C. D. Pilmer (c) There are no outliers. 2. (a) population (b) numerical (c) µ = 159.44 Median = 157 3. (a) sample (b) x = 35 (34.6) 5% Trimmed Mean Median = 31 x(T ) = 31 (30.6) 10% Trimmed Mean x(T ) = 31 (30.9) (c) Trimmed means are appropriate because the outlier 115 exists within the data set. (d) Four data points from the bottom and four data points from top of the data set 4. (a) x = 268 (267.875) Median = 254 (253.5) (b) Median and Trimmed Mean (c) Histogram x(T ) = 255 (255.409) 5. (a) This score system was likely implemented to eliminate the effect of a single rogue judge who would inflate or deflate the score of a particular athlete. (b) The method used in gymnastics and diving removes only one high score and one low score. If more than one judge work together to inflate or deflate the score of a particular athlete then this particular trimmed mean technique will eliminate only one rogue judge, but not all. In the case of this figure skating competition, we were dealing with more than one rogue judge. Describing Data, Part 2 (pages 17 to 24) 1. (x xi xi − x 25 -3 −x 9 32 4 16 24 -4 16 28 0 0 31 3 9 28 0 0 i ) 2 Sum = 50 Sx = NSSAL ©2010 50 = 3.16 6 −1 96 Draft C. D. Pilmer 2. xi xi − µ ( x i − µ )2 3.7 -0.6 0.36 4.3 0 0 5.0 0.7 0.49 4.6 0.3 0.09 4.0 -0.3 0.09 4.7 0.4 0.16 3.9 -0.4 0.16 4.2 -0.1 0.01 Sum = 1.36 σ= 1.36 = 0.41 8 3. (a) First Data Set: x = 15 , S x = 1.58 Second Data Set: x = 15 , S x = 2.65 (b) Although the sample means are the equal, the sample standard deviations are different. Since the standard deviation is lower for the first data set, then we now that the individual data points are more clustered around the mean compared to the values in the second data set. 4. (a) 543 (b) 544 (c) S x = 5.24 (d) Although the means and standard deviations for the two samples would be similar, they would likely not be the same. Because samples are a subset of the population, it is very unlikely that the two samples would draw the same individual pieces of data. 5. (a) (b) (c) (d) (e) 183 182 numerical data set σ = 4.90 The average heights of these two groups of learners are the same however the standard deviation for Barb’s group is much lower. That means that there is less variation in heights between Barb’s male learners compared to the other instructor’s learners. The heights of her learners are more clustered around the mean. (f) The standard deviations are almost the same for the two groups of male learners, however, the mean height for Barb’s group is higher. We can conclude that the average height of male learners in Barb’s math courses is three centimeters more than the third instructor’s male students. The variation in heights between the two groups is essentially the same. NSSAL ©2010 97 Draft C. D. Pilmer 6. Answers will vary. 7. Histogram (i) matches with (c) Histogram (ii) matches with (b) Histogram (iii) matches with (d) Histogram (iv) matches with (a) Using Technology (pages 25 to 28) 1. (a) sample (b) (c) x = 155.6 , median: 156, S x = 18.3 (d) normal distribution 2. (a) population (b) (c) µ = 55.6 (d) σ = 9.5 (e) median: 54.5 (f) The data does not cluster well around the mean. 3. (a) population (b) (c) µ = 14.1 , median: 9.91 , σ = 11.2 (d) The mean is high because the incarceration rate for the Northwest Territories is so much higher than the rates. NSSAL ©2010 98 Draft C. D. Pilmer Normal Distribution (pages 29 to 34) 1. The first data point in List 5 represents the lowest total honey production over a four year period for one of the one hundred randomly selected hives. 2. The last data point in List 5 represents the highest total honey production over a four year period for one of the one hundred randomly selected hives. 3. The sample mean, x , represents the average total honey production over the four year period of the one hundred randomly selected hives. 4. (a) Answers will vary. (b) Answers will vary but there should be around 68 (give or take 3 or 4). (c) 68%, it should be supported because you should have about 68 out of 100 data points within this range. 5. (a) Answers will vary. (b) Answers will vary but there should be around 95 (give or take 3 or 4). (c) 95%, it should be supported because you should have about 95 out of 100 data points within this range. Using the 68-95-99.7 Rule (pages 35 to 39) 1. 2. (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) Hint: Between Between Between Between Between Between Between Between Between Between (a) Hint: Between x − 3S x and x + 3S x (b) Between x − S x and x + S x -- 68% 1360 (c) Between x − 2 S x and x -- 47.5% 950 (d) Between x − S x and x + 2 S x 34% + 47.5% 81.5% 1630 NSSAL ©2010 µ − σ and µ + σ µ and µ + 2σ µ − σ and µ µ − 3σ and µ µ − 2σ and µ + σ µ − σ and µ + 3σ µ − 3σ and µ + 2σ µ + σ and µ + 2σ µ − 3σ and µ − 2σ µ + σ and µ + 3σ Calculation: ----47.5% + 34% 34% + 49.85% 47.5% + 49.85% 47.5% - 34% 49.85% - 47.5% 49.85% - 34% Answer: 68% 47.5% 34% 49.85% 81.5% 83.85% 97.35% 13.5% 2.3% 15.85% Calculation: -- Percentage: 99.7% Answer: 1994 99 Draft C. D. Pilmer (e) Hint: Between x + S x and x + 2 S x Calculation: 47.5% - 34% Percentage: 13.5% Answer: 270 (f) Between x − 2 S x and x + 2 S x -- 95% 1900 (g) Between x − 3S x and x − S x 49.85% - 34% 15.85% 317 (h) Between x − 2 S x and x + 3S x 47.5% + 49.85% 97.35% 1947 (i) Between x − 3S x and x -- 49.85% 997 (j) Between x + 2 S x and x + 3S x 49.85% – 47.5% 2.35% 47 Collecting a Sample (pages 41 to 44) Conclusions (a) and (b) Although it is not guaranteed, most learners’ non-random samples will not be a good representation of the population. Generally students will choose a few small buildings (between 1000 and 4000 sq. ft.), a few medium sized buildings (between 6000 and 8000 sq. ft.) and at least one large building (between 9000 and 16000 sq. ft.). The population, however, has a much greater proportion of smaller buildings, than medium or large buildings. The random samples are more likely to capture this and therefore be better representations of the population. The purpose of this investigation was to show that when we conduct surveys we should be using random sampling techniques to ensure that end up with unbiased samples. 1 (a) Conducting the survey at an ultimate fighting competition is problematic. This type of competition is extremely physical and some would say violent. Asking viewers their options on media violence will not likely produce data that is representative of the general population. (b) There are two problems with Genevieve’s survey. The first is the location. Shopping malls do not serve needs of all shoppers. Low income families will likely use other shopping establishments. High incomes individuals may shop predominantly at specialty stores or boutiques. There is also the issue that Mic Mac Mall serves predominantly an urban, rather than rural, clientele. The second problem lies in the manner she selected survey participants. They were not randomly selected. She approached people who she felt would answer her survey questions. She may inadvertently omit individuals from differing age groups, cultures, or social economic groups. (c) Not everyone who views this show and has an option regarding the talent of the contestants will participate in the voting. It is often difficult to register a vote through all the busy circuit signals therefore only individuals who have a strong option regarding the competition are likely to vote. Many of these will vote more than once. The other matter is the cost. In some cases, the individuals have to incur long distance phone charges. This may serve as a deterrent for some low income individuals from participating in the vote. (d) This survey technique has similar problems to the survey discussed in (b); location and selection of participants. Conducting a survey on gun registration at a hardware store is problematic. The store likely deals with more male clientele than female. In addition to NSSAL ©2010 100 Draft C. D. Pilmer this, the store likely sells firearms and therefore likely attracts a greater proportion of hunters and firearms enthusiasts than other establishments. Participants are not randomly selected for this survey, rather they volunteer to respond. Individuals who have strong opinions on the matter are likely to respond and they may respond more than once. (e) The problem is not the sampling technique; rather it is the question itself. The question is “loaded” in that it initially presents negative aspects of war in Afghanistan and then asks the question whether Canadian soldiers should remain in the conflict. The solution is not to include the positive aspects related to Canada’s involvement in the war, rather to create a question that does not identify positive or negative aspects. The question should simply be, “Should Canadian soldiers remain in Afghanistan?” Sampling Methods (pages 45 to 49) 1. (a) (c) (e) (g) (i) (k) systematic cluster simple random stratified volunteer response stratified 2. (b) (d) (f) (i) Asra’s cafeteria survey Montez’s Gas Station Survey Reality Show Online Voting Ranelda going to TripAdvisor.com (b) (d) (f) (h) (j) volunteer response convenience volunteer response cluster simple random 3. Answers will vary slightly. (a) Place the numbers 80000 through 82500 on separate pieces of paper, place the pieces of paper in a drum, stir the drum, and draw 500 pieces of paper. (b) Assign the numbers 1 through 44 to each of the high schools. Randomly select five numbers between 1 and 44. Review all of the math exams from the five schools with those assigned numbers. (c) Look at the enrollment of grade 12 math students in each of the 44 schools. Randomly select 500 exams in such a manner that each school is proportionally represented. (d) Randomly select a number between 0 and 4. If, for example, the number 3 is obtained, then every exam whose identification number ends with this digit would be selected for review. Simulated Sampling (pages 50 to 52) Sample 1 The first data point in the table represents the airborne contaminant level measured in µg / m 2 for the first randomly selected Canadian household in the first sample of size 40. NSSAL ©2010 101 Draft C. D. Pilmer Sample 2 It is highly unlikely that the first five and last five data points in the two tables are going to match because we are dealing with different samples. For example is it unlikely that the first randomly selected household from the millions across Canada in the first sample would have the same airborne contaminant level as first randomly selected household in the second sample. 1. The mean for Sample 1 represents the average contaminant level in µg / m 2 for the 40 randomly selected households. 2. The sample means differ because they are from four different samples. Each sample contains different data points and therefore likely results in a different mean. 3. The expected value is the population mean ( µ = 412). The sample means should be fairly close to the population mean. 4. Statement (d) is correct. (d) The population mean is fixed and the sample mean is random. Explanation: If you consider our simulations, the sample means differed hence they are random while our population mean remained fixed at 412. 5. No Sampling Distribution of the Sample Means (pages 53 to 57) (i) The calculator must generate 40 random numbers from the specified normal population, work out the mean for those 40 data points, store that piece of information in List 1, and then repeat that procedure 99 more times. This is obviously a time consuming process even for a calculator. (ii) Of the 100 simulated samples, the first value in the table represents the smallest sample mean obtained from our 100 random samples. These 40 randomly selected households had the lowest mean airborne contaminant reading. (iii) Normal Distribution 1. (a) (b) (c) (d) 412 Answers will vary but it should be very close to 412. Answers will vary but it should be very close to 412. If we took just one random sample of large enough size, we would expect it to be fairly close to the population mean. However, we collected 100 samples of the same size, worked out the sample means, and then averaged those 100 sample means. One would expect that this average would be very close to the population mean. Mean of the Sample Means = Population Mean NSSAL ©2010 102 Draft C. D. Pilmer 2. (a) (b) (c) (d) 38 Answers will vary but it should be very close to 6. Answers will vary but it should be very close to 4.9. We should have learned the following. Population Standardard Deviation Standard Deviation of the Sample Means = Sample Size Central Limit Theorem (pages 58 to 67) 1. (a) x (c) S x (e) n (g) σ (b) σ x (d) µ (f) µ x 2. (a) 329 (b) 4.02 3. (a) normal (c) 0.93 (e) Between 85.14 and 88.86 (b) 87, centred on the normal distribution (d) 68% 4. Centred about the population mean: 106.2 km/h Spread out about its centre: 0.55 km/h (standard deviation of the sample mean) 5. (a) 28.5 kg (b) 5.0 kg (c) 27.9 kg (d) It represents the mean luggage weight for our sample of size 30. It is close to the expected value (population mean). (e) 4.72 kg (f) 0.91 kg (g) 28.5 kg (h) Between 27.59 and 29.41 (i) Between 26.68 and 30.32 6. The sample standard deviation describes how spread out or clustered individual data points from a single sample are relative to one another. The standard deviation of the sample mean describes how spread out or clustered sample means derived from repeatedly collected samples of the same size are relative to one another. 7. (a) iii (c) ii (b) i 8. (a) µ x = 86, σ x = 0.51 (b) Between 84.47 and 87.53 NSSAL ©2010 103 Draft C. D. Pilmer 9. (a) mean of sample means = 98.6 (b) mean of sample means = 98.6 (c) mean of sample means = 98.6 standard deviation of sample means = 1.71 standard deviation of sample means = 1.39 standard deviation of sample means = 1.21 10. As the sample size increases, the standard deviation of the sample means gets smaller, meaning that the sample means are more clustered. This seems logical. If you increased the sample size, it is more likely that this one sample is more representative of the population and therefore has a sample mean that is close to the population mean. If one repeatedly collects random samples of a larger size then one would expect that the resulting sample means are collectively closer to the population mean than sample means derived from samples of a smaller size. That means that the standard deviation of the sample means will smaller for these larger sample sizes. 11. (a) 40 (b) Sample size 30 corresponds to sampling distribution iii. Sample size 60 corresponds to sampling distribution ii. Sample size 90 corresponds to sampling distribution i. Reason: As the sample size increases, the standard deviation of the sample means gets smaller, meaning that the sample means are more clustered around the population mean. 12. (a) iii (c) iv (b) i (d) ii Point Estimates and Interval Estimators (pages 68 to 78) 1. It means that the method used to create the confidence interval from $292 to $304 has a 0.95 probability of enclosing the true mean power bill for all households in Nova Scotia (i.e. population mean). There is a 0.05 probability (or 5% chance) that the method created an interval that did not enclose the population mean. 2. (a) From 574.5 to 591.5 (b) The method that produced the confidence interval from 574.5 to 591.6 has a 0.99 probability (or 99% chance) of enclosing the population mean. There is a 0.01 probability that the method produced an interval that does not enclose the population mean. (c) We cannot tell if this confidence interval encloses the population mean because we are dealing with an unknown population. (d) 495 3. (a) 74.8, point estimate (b) From 72.7 to 76.9, interval estimator (c) No, when he states that the population mean “falls within the interval”, he is implying that the population mean is random, rather than fixed. (d) 20, no NSSAL ©2010 104 Draft C. D. Pilmer 4. (a) (From the Calculator: x = 332.6 and S x = 47.2 ) From 319.7 litres to 345.5 litres (b) (From the Calculator: x = 327.6 and S x = 37.3 ) From 311.0 litres to 344.2 litres (c) There is a greater likelihood that the method that produced the 99% confidence interval encloses the population mean because we are dealing with a higher confidence level (99% opposed to 90%). 5. (a) (b) (c) (d) (e) From 144.2 to 151.2 From 143.4 to 152.6 From 142.0 to 154.0 As the confidence level increases, the width of the confidence interval increases. The 90% confidence interval did not enclose the population mean but the other two confidence intervals did. 6. (a) (b) (c) (d) From 52.05 to 55.55 From 53.94 to 55.86 From 53.69 to 54.91 Width of First Confidence Interval: 55.55 – 52.05 = 3.50 Width of Second Confidence Interval: 55.86 – 53.94 = 1.92 Width of Third Confidence Interval: 54.91 – 53.69 = 1.22 Yes: as the sample size increases, the width of the confidence interval decreases. 7. As you learned in the previous questions, only the sample size and confidence level will affect the width of the confidence interval. When everything else is constant, larger sample sizes produce narrower intervals. When everything else is constant, higher confidence levels produce wider intervals. 8. (a) True (b) False - The confidence interval is worked out using the sample mean. The sample mean is in the middle of the confidence interval therefore there is a 100% chance that it is enclosed within the interval. (c) True (d) True (e) False - Larger sample sizes produce narrower, rather than wider, intervals (f) False – The problem with this statement is that they are saying that the population mean will fall between the two values. This implies that the population mean is random, rather than fixed. (g) False – Confidence intervals are designed so that they have a strongly likelihood of enclosing the population mean. Confidence intervals are quite narrow compared to the wide range of data points one would expect to obtain from a single random sample. 9. From 6.08 to 6.32 The method that produced the interval from 6.08 to 6.32 has a 0.95 probability of enclosing the true mean rainfall pH (i.e. population mean). There is a 0.05 probability that this method created an interval that does not enclose the population mean. NSSAL ©2010 105 Draft C. D. Pilmer 10. This applet allows one to generate one hundred 95% confidence intervals and one hundred 99% confidence intervals for a known population ( µ = 50 ) and track how many enclose the population mean. When I used it, I obtained the following. It shows that 98 of my one hundred 99% confidence intervals, and 93 of my one hundred 95% confidence intervals enclosed the population mean. Every time we press SAMPLE, more confidence intervals are generated and a running record is kept in the chart at the bottom of the window. 11. (a) (From the Calculator: x = 47.8 and S x = 3.9 ) From 46.8 months to 48.8 months (b) We do not know because we are not supplied with the population mean. We are dealing with an unknown population. (c) The sample mean and sample standard deviation would likely change, therefore we would end up with a different confidence interval. (d) Width would increase (e) Width would decrease, assuming no significant change in the sample standard deviation. Putting It Together (pages 79 to 87) 1. Population: all 1386 members of the sportsplex Sample: the 230 randomly selected members 2. (a) Numerical, Discrete (c) Numerical, Continuous (e) Numerical, Discrete (b) Categorical (d) Categorical (f) Numerical, Continuous 3. (a) Bimodal (c) Normal (b) Skewed (left) (d) Uniform 4. (a) (b) (c) (d) Population: All suitcases on domestic flights Histogram x = 14.5 kg, Median = 14.8 kg, x(T ) = 14.9 kg Median and Trimmed Mean 5. (a) Sample (b) 13.8 g/dl NSSAL ©2010 106 Draft C. D. Pilmer (c) 1.06 g/dl 6. (a) (c) (e) (g) 204 598 584 81 7. (a) (b) (c) (d) (e) Systematic Sampling (preferred) Simple Random Sampling (preferred) Voluntary Response (poor) Stratified Sampling (preferred) Cluster Sampling (preferred) (b) (d) (f) (h) 285 489 503 14 8. (a) Normal (b) µ x = 107.2 , centered on the bell curve (c) 1.07 µ = $11.52 σ = $1.47 x = $11.71 It represents the average cost of lunch for the 37 randomly selected customers at this particular restaurant. (e) n = 37 (f) S x = $1.46 (g) µ x = $11.52 9. (a) (b) (c) (d) (h) σ x = $0.24 (i) From $11.28 to $11.76 10 (a) No, the mean of the sample means should equal the population mean regardless of the sample size. (b) Yes, as the sample size increases, the standard deviation of the sample means decreases. 11. (a) From 283.3 to 290.9 (b) It means that the method that produced the interval from 283.3 to 290.9 has a 0.9 probability of enclosing the population mean. There is a 0.1 probability (or 10% chance) that this method created an interval that does not enclose the population mean. (c) 450 12. (a) x = 178.3 cm, point estimate (b) From 176.5 cm to 180.1 cm, interval estimator (c) From 175.9 cm to 180.7 (d) The method that produced the 99% confidence interval has a greater likelihood of enclosing the true mean height (i.e. population mean) because it has a higher confidence level and therefore results in a wider interval. NSSAL ©2010 107 Draft C. D. Pilmer 13. From 47.9 cm to 48.3 cm It means that the method that produced the interval from 47.9 cm to 48.3 cm has a 0.95 probability of enclosing the true mean head circumference (i.e. population mean). There is a 0.05 probability (or 5% chance) that this method created an interval that does not enclose the population mean. We cannot tell whether the interval encloses the population mean because we are dealing with an unknown population. 14. (a) (b) (c) (d) (e) From 32.31oC to 34.03oC Yes, sample means are random (not fixed). 30 decrease increase NSSAL ©2010 108 Draft C. D. Pilmer

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Inferential Statistics Unit