Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
1.1 Definitions 1.1.1 Define the terms: statistics, data, population, and sample. Statistics refers to the science of collecting, organizing, presenting, analyzing, and interpreting data to assist in making more effective decisions. Data are measurements of one or more variables of a sample which was drawn from a population. A population is the complete collection of elements (scores, people, measurements) in which we want to study A sample is a set of individual elements (again scores, people, measurements…) taken from a population. 1.2 Type of Statistics 1.2.1 Describe the difference between descriptive and inferential statistics The study of Statistics (as will the chapters of our textbook) can be broken down into two main types: Descriptive Statistics and Inferential Statistics. Descriptive Statistics usually utilizes graphs, charts, tables, or calculations to describe data. On your income tax brochure, there is usually a pie chart which shows the breakdown of your tax dollar. It clearly identifies where your hard earned dollar goes. The other type is known as Inferential Statistics and this usually is utilized when making a statement, reaching a decision, or coming to some conclusions. An example would be when the local chip truck wants to trial test a new brand of crispy fries. A sample of the loyal customers might be given the new chips and their responses to its tastiness and marketability would give the owners sufficient information for launching the newer and better chips! In summary, descriptive statistics are methods of organizing, summarizing, and presenting data in an informative way whereas inferential statistics can be a decision, an estimate, a prediction, or a generalization about a population, based on a sample. 1 1.2.2 Identify the four types of data and the characteristics of each type. Data are measurements of one or more variables of a sample which was drawn from a population. This data can be classified into four types: nominal, ordinal, interval, or ratio. Types of Data: Nominal data, as the name suggests, is essentially when the researcher puts a name to his or her observations. Warranty cards and surveys often ask one to check the box that best describes, say, one’s profession. The list may include secretarial; professional; skilled trade. In such a case it is clear that we are simply naming data. Confusion may creep in however when we use numbers as names. For instance, if we were comparing the performance of men and women on some task, we might for ease of computation refer to all men as 0 and all women as 1. Such numbers are still really just names, however, and would have no more computational value than the simple label men or women. Ordinal data is essentially the same as nominal data, but in this case the data may meaningfully be arranged in order: for instance tall, medium, short. A psychologist may want to rank children in a class on the basis of how skilled they are at reading. The resultant data will tell us who is the best reader, who is the second best, and so on, but gives us no information on how much the children differ from each other in terms of reading skill. Interval data is perhaps the most commonly-collected type of data in Psychology. In interval data, the difference between any two adjacent numbers is equal to the difference between any other two adjacent numbers. In other words, an interval scale allows one to measure differences in size or magnitude. This may seem a confusing concept, but bear in mind that the IQ scale, for instance, is an interval scale. If we say that Jim scores 120 on an IQ test, and Tom 90, we can say that Jim’s score is 30 points higher. We know that Jim’s score is the greater of the two, and we know by how much it is greater. However it is not possible to say that Jim is one-third more intelligent than Tom. This is because the IQ scale has no really meaningful absolute zero point. Ratio data contains all of the attributes of interval data, but includes in addition an absolute zero point. A good example is scores in an exam. Because of the properties of the ratio scale, if Jim scores 100 in a test and Tom 50, we may meaningfully say that Jim has scored twice as highly as Tom. Note: Qualitative (or categorical) Data are non numerical data which includes nominal and ordinal data. Quantitative (or numerical) Data are numerical data which includes interval and ratio data. 2 2.0 Descriptive Statistics 2.1 Discuss what is meant by a frequency distribution. Frequency Distributions: Sometimes we need to reduce a large set of data into a much smaller set of numbers that can be more easily comprehended. Lets take for example if you have recorded the population sizes of 500 randomly selected cities, there is no easy way to examine these 500 numbers visually and learn anything. It would be easier to examine a condensed version of this set of data and this is where the frequency distribution comes into play. Hence, a frequency distribution is a grouping of data into mutually exclusive classes showing the number of observations in each. Lets look at an example of a frequency distribution: Class number 1 2 3 4 5 6 7 8 9 10 Number of dolls sold 5000 up to 10000 10000 up to 15000 15000 up to 20000 20000 up to 25000 25000 up to 30000 30000 up to 35000 35000 up to 40000 40000 up to 45000 45000 up to 50000 50000 up to 55000 frequency 1 5 2 2 6 4 9 8 4 7 This frequency distribution describes the number of dolls sold by a group of companies. For example, you can see that 1 company sold between 5000 and 10000 dolls, 5 companies sold between 10000 and 15000 dolls, etc. How many companies are there in total represented in the frequency distribution? 1+5+2+2+6+4+9+8+4+7= 48 companies. If a company sells 10000 dolls exactly, which class would it be counted under? It would be counted in the 10000 up to 15000 class NOT the 5000 up to 10000 class. 2.1.2 Define the terms: frequency, relative frequency, classes and class limits frequency: how often something happens (count the number of times) 3 class: how the data is split up (For example, the data above is split up into 10 classes) class limits: Class limits are the highest and lowest values in a particular class . For example in class number 2 above the lower class limit is 10000 and the upper class limit is 15000. relative frequency: Relative class frequency is the percentage (given in decimal form) of the data values which lie in each class. i.e. relative frequency= frequency of a given class___ total number of values in the data example#1: What is the relative frequency of class #3? example#2: What is the relative frequency of class#7? **2.1.3 Constructing a frequency distribution When you are given data(a bunch of numbers) and you are asked to construct a frequency diagram there are certain steps you need to take. The following example will illustrate. Construct a frequency diagram for the following data on prices of vacation packages to Europe. Data for prices of European vacation packages: 2599 3800 9720 4200 1200 9366 8255 4580 5379 5299 3349 2470 3855 1200 9945 899 5208 3800 1199 5399 2100 2100 999 7399 3557 2200 6899 9999 4 Steps: 1. Determine the number of classes (intervals) to use. (i.e. how to split the data up) This requires judgment. It is best to have between ________________ classes. (10 is usually good) 2. Come up with the class width. (CS) using the following equation: (Tip! To make your frequency distribution easier to interpret, it is good to round to the nearest ten, hundred, thousand, etc) Note: This will cause the number of classes to change but not by much. 3. Make sure the lowest class contains the lowest data value and begin with a value that makes the frequency distribution easy to interpret. In our case the lowest class must include 899, so the first class could be 0-900. It doesn’t have to start at 0 though. We could have picked 100-1000 or 200-1100 for our first class. The frequency distribution for the data given above: 5 2.1.4 Construct a relative frequency distribution Example: Use the frequency distribution constructed above in 2.1.3, and turn it into a relative frequency distribution. Steps: Step 1: Do a frequency distribution. Step 2: Calculate the relative frequency for each class and put this info into a new column called “relative frequency”. Price of ticket package ($) 0-900 Frequency 900-1500 4 1800-2700 5 2700-3600 2 3600-4500 4 4500-5400 5 5400-6300 0 1 6 6300-7200 1 7200-8100 1 8100-9000 1 9000-9900 2 9900-10800 2 2.1.5 Construct a cumulative frequency distribution Example: Use the frequency distribution constructed above in 2.1.3, and turn it into a cumulative frequency distribution. Steps: Step 1: Do a frequency distribution Step 2: Add another column called “cumulative frequency” where you add the frequencies of all the class frequencies below it to the given class. Price of ticket package ($) 0-900 Frequency 900-1500 4 1800-2700 5 2700-3600 2 3600-4500 4 4500-5400 5 5400-6300 0 6300-7200 1 1 7 7200-8100 1 8100-9000 1 9000-9900 2 9900-10800 2 Extra Practice: Textbook readings: pg25-28, pg30, pg38-40 For additional practice try these in your textbook. pg 31: #1-8 pg 41: #15a,b and 16a,b Recommended problems #1 1. A set of data contains 53 observations. The lowest value is 42 and the largest is 129. The data are to be organized into a frequency distribution. a. How many classes would you suggest? b. What would you suggest as the lower limit of the first class? 2. A manufacturing company produced the following number of units during the last 16 days. 27 26 27 28 27 26 28 28 27 31 25 30 25 26 28 26 The information is to be organized into a frequency distribution. a. How many classes would you recommend? b. What class interval would you suggest? c. What lower limit would you recommend for the first class? d. Organize the information into a frequency distribution and determine the relative frequency distribution. e. Comment on the shape of the distribution. 3. An oil company has a number of outlets in the metropolitan Seattle area. The numbers of oil changes at the Oak Street outlet in the past 20 days are: 65 70 98 62 55 66 62 80 79 94 59 79 51 63 90 73 8 72 71 56 85 The data are to be organized into a frequency distribution. a. How many classes would you recommend? b. What class interval would you suggest? c. What lower limit would you recommend for the first class? d. Organize the number of oil changes into a frequency distribution. e. Comment on the shape of the frequency distribution. Also determine the relative frequency distribution. 4. The manager of a supermarket gathered the following information on the number of times a customer visits the store during a month. The responses of 51 customers were: 5 6 3 11 9 3 7 5 3 2 3 1 3 12 12 1 1 4 4 4 14 5 7 4 1 6 6 5 2 8 5 6 4 4 15 4 4 7 1 2 4 6 1 6 5 5 10 6 6 9 8 a. Starting with 0 as the lower limit of the first class and using a class interval of 3, organize the data into a frequency distribution. b. Describe the distribution. Were do the data tend to cluster? c. Convert the distribution to a relative frequency distribution. 5. The food services division of an amusement park is studying the amount families who visit the amusement part spend per day on food and drink. A sample of 40 families who visited the park yesterday revealed they spent the following amounts. 77 50 63 63 18 34 62 58 63 44 62 61 84 41 65 71 38 58 61 54 58 52 50 53 60 59 51 60 54 62 45 56 43 66 36 52 83 26 53 71 a. Organize the data into a frequency distribution, using seven classes and 15 as the lower limit of the first class. What class interval did you select? b. Where do the data tend to cluster? c. Describe the distribution. d. Determine the relative frequency distribution. 9 6. The frequency distribution representing the number of frequent flier miles accumulated by employees at a consulting company is represented below: Frequent flier miles (in thousands) 0 up to 3 3 up to 6 6 up to 9 9 up to 12 12 up to 15 total Frequency 5 12 23 8 2 50 a. How many employees accumulated less than 3000 miles? b. Convert the frequency distribution to a cumulative frequency distribution. 7. The frequency distribution of order lead time at a firm is: Lead time (days) 0 up to 5 5 up to 10 10 up to 15 15 up to 20 20 up to 25 total Frequency 6 7 12 8 7 40 a. How many orders were filled in less than 10 days? In less than 15 days? b. Convert the frequency distribution to a cumulative frequency distribution. MA1670 Comprehensive Assignment: 20%, out of 337 marks Show all workings. All questions must be done by hand unless otherwise specified. Full marks will not be given if workings are insufficient. 1. a. Construct and fully label a relative frequency distribution to represent the data below. The data represents the price of books on one shelf in a bookstore. (4 marks) 16 22 10 25 19 5 60 28 30 40 85 80 40 45 30 10 15 22 28 37 10 b. Determine the mean, median, mode, and midrange(omit midrange) of the data. (You may do this by hand or by using technology) (4 marks) c. Construct a histogram to represent the data. (4 marks) 2. The following ogive represents the energy efficiencies for a group of buildings owned by a company. Answer each of the following questions. a. Approximately what percentage of the company’s buildings has an energy efficiency of 220 kWh/sq m/yr? (1 mark) b. 70% of buildings have less than what energy efficiency? (1 mark) 3. The following pie chart represents the favourite sports of a class of 300 students. 11 a. How many students chose apple? (don’t give the percent) (1 mark) b. How many students in total chose pecan and coconut cream? (1 mark) 4. Use the data below to find the following: 6, 29, 9, 45, 23, 30, 36, 10, 26, 30 a. 10th percentile (4 marks) b. 29th percentile (4 marks) c. the third quartile (4 marks) 5. Through calculating standard deviation, determine which set of data below (set A or set B) has greater variation (i.e. the data has greater dispersion). (9 marks) Set A: 5, 6, 9, 9, 11, 13 Set B: 1, 5, 9, 11, 12, 15 (omit)6. Use Chebychev’s Theorem to determine what proportion of data will generally fall within +2 or -2 standard deviations of the mean of the data? (note: this does not refer to the data in #5) (2 marks) 7. Use the frequency distribution below to answer the following questions. # of hours worked frequency 0-5 4 5-10 2 10-15 0 15-20 5 a. What is the mean? (5 marks) b. What is the standard deviation? (5 marks) (omit)8. The price for a particular type of jeans has changed over the years, as seen in the data below. Year Price ($) 1980 25 1985 32 1990 40 1995 52 a. Using the 1980 price as the base value, what is the index number for the 1990 cost? (1 mark) b. Using the 1990 price as the base value, what is the index number for the 1995 cost? (1 mark) 12 9. A bag contains 58 balls. red white Blue 37 12 9 You pick one ball. a. What is the probability that the ball is red? (1 mark) b. What is the probability that the ball is blue or red? (1 mark) c. Are the events, “pick a blue ball” and “pick a red ball” mutually exclusive? (assume you only pick one ball) (1 mark) You pick two balls. d. What is the probability that the first ball is red and the second ball is blue if you replace your first pick? (2 marks) e. What is the probability that the first ball is red and the second ball is blue if you do not replace your first pick? (2 marks) f. What is the probability that you pick two white balls in a row if you replace your first pick? (2 marks) g. What is the probability that you pick two white balls in a row if you do not replace your first pick? (2 marks) You pick five balls. h. What is the probability that you pick five red balls in a row if you do not replace any of your picks? (2 marks) 10. A bag contains the following mixture of balls. red white blue glittery 20 2 6 If you pick one ball: a. What is the probability that it is red and glittery? (1 mark) b. What is the probability that it is red or glittery? (2 marks) c. What is the probability that it is dull? (1 mark) 13 dull 17 10 3 d. What is the probability that it is blue and dull? (1 mark) e. What is the probability that it is dull or white? (2 marks) f. What is the probability that it is white or blue? (2 marks) If you pick two balls: g. What is the probability that you pick a red, dull ball and then a white, glittery ball, if you replace your first pick? (2 marks) h. What is the probability that you pick a red, dull ball and then a white, glittery ball, if you do not replace your first pick? (2 marks) If you pick three balls: i. What is the probability that you pick 3 glittery, blue balls in a row if you replace your picks each time? (2 marks) j. Which of the following is more likely: (2 marks) picking three red, glittery balls in a row if you do not replace any of your picks or picking three dull, blue balls in a row if you replace your picks each time 11. A hairdresser knows how to do eight different haircuts, has five different dyes, and four different type of highlights. How many different hairdos are possible for someone who wishes to get a haircut, dye their hair, and get highlights? (2 marks) 12. In how many ways can horses in a 10-horse race finish first, second, and third? (3 marks) 13. How many different simple random samples of size 4 can be obtained from a population whose size is 20? (3 marks) 14. What is the probability of flipping a coin and getting five heads in a row? (2 marks) 15. In a class, 45% are women and 55% are men. A test is given and it is determined that 10% of the women failed and 15% of the men failed. a. What is the probability that if a student failed, then they are male? (4 marks) b. What is the probability that if a student passed, then they are female? (4 marks) 16. State whether each of the following variables is discrete or continuous. a. number of newspapers sold (1 mark) b. number of days missed (1 mark) c. height of students (1 mark) 14 d. length of table (1 mark) 17. The table below represents a discrete probability distribution. x 0 1 2 3 4 P(x) 0.10 0 0.40 0.35 0.15 a. Find the probability that x is: i. exactly 3 (1 mark) ii. 3 or less (1 mark) iii. more than 2 (1 mark) b. Compute the mean of the distribution. (3 marks) c. Compute the variance of the distribution. (3 marks) d. Compute the standard deviation of the distribution. (1 mark) (omit)18. It is determined that 20% of the population in a town have type A-positive blood. A simple random sample of size 7 is taken and the number of people X with blood type Apositive is recorded. Determine the probabilities of the following events using the binomial formula. a. none of them have type A-positive blood (4 marks) b. exactly two of them have type A-positive blood (4 marks) c. exactly five of them have type A-positive blood (4 marks) 19. A telemarketer makes 9 phone calls per hour and is able to make a sale on 10% of these contacts. Determine (using your choice of method) (omit)a. the probability of making exactly four sales? (1 mark) (omit)b. the probability of making exactly six sales (1 mark) (omit)c. the probability of make no sales? (1 mark) d. the mean number of sales (3 marks) e. the variance of the sales (3 marks) f. the standard deviation of the sales (1 mark) (omit)g. the probability of making at least 6 sales (hint: this equals the probability of making 6 sales plus the probability of making 7 sales plus….etc.) (2 marks) 15 20. The annual commissions earned by sales representatives at a company follow a normal distribution. The mean yearly amount earned is $40 000 and the standard deviation is $5000. (4 marks each) a. What percent of the sales representatives earn between $40 000 and $42 000? b. What percent of the sales representatives earn more than $42 000? c. What percent of the sales representatives earn less than $42 000? d. What percent of the sales representatives earn between #32 000 and $42 000? e. What percent of the sales representatives earn between $32 000 and $35 000? 21. The weights of cans of pears follow the normal distribution with a mean of 1000 grams and a standard deviation of 50 grams. Calculate the percentage of cans that weight: (4 marks each) a. less than 860 grams b. between 1055 and 1100 grams c. between 860 and 1055 grams 22. Of a telemarketer’s calls, 0.20 are successful. Suppose that 75 of the telemarketer’s calls are randomly selected. Use the normal approximation to the binomial distribution to determine the probability that: (4 marks each) a. fewer than 20 calls are successful b. fewer than 13 calls are successful c. more than 16 calls are successful d. more than 14 calls are successful 23. State what method of probability sampling each of the following is. a. A sample of a whole school needs to be taken so a sample of students from each class is taken to represent the school. (1 mark) b. A sample of a whole school needs to be taken so one class is chosen for the sample to represent the entire school. (1 mark) c. A sample of a whole school needs to be taken so all students names are put in a hat and names are drawn from it to act as the sample. (1 mark) d. A sample of a whole school needs to be taken so every 15th student is chosen for the sample. (1 mark) 24. The standard deviation of a population is 1.6 and the mean is 24.6 a. What is the standard deviation of the sampling distribution of the sample mean if a sample size of 8 is chosen? (2 marks) 16 b. What is the mean of the sampling distribution of the sample mean? (1 mark) c. What is the standard error of the mean? (1 mark) d. How does the spread of the sampling distribution of the sample mean compare with the spread of the population? (1 mark) 25. The mean of a population is 140 and the standard deviation is 5.2. a. What is the value above which 8% of the data lies? (4 marks) b. What is the value above which 70% of the data lies? (4 marks) c. What is the value below which 10% of the data lies? (4 marks) 26. A company wishes to determine whether the mean raging score for its employees is greater than 85. The rating score was determined for a random sample of 122 managers with the following results: the sample mean was 99.6 and the sample standard deviation was 12.6. Can the company conclude that the mean rating score for employees is greater than 85? Use the 0.025 significance level. (8 marks) 27. In 1995, 74% of the population felt that men were more aggressive than women. In a poll, a simple random sample of 1026 people 18 years old or older resulted in 698 respondents stating that men were more aggressive than women. Is there significant evidence to indicate that the proportion of people who believe that men are more aggressive than women has decreased from the level reported in 1995 at the α=0.05 level of significance? (8 marks) 28. A department wishes to determine whether the mean number of suicide bombings for all Al Qaeda attacks against the US differs from 2.5? A sample of 21 recent incidents involving suicide terrorist attacks was analyzed and the sample mean was 1.86 with a sample standard deviation of 1.20. Use the 0.10 significance level to determine whether the mean number of suicide bombings is differs from 2.5. (8 marks) 29. In order to compare the means of two populations, independent random samples of 64 observations were selected from each population, with the following results: Population 1 Population 2 At the 0.05 significance level, can we conclude that the mean of population one is greater than the mean of population two? (8 marks) 17 30. In order to compare the means of two populations, independent random samples of 20 observations were selected from each population, with the following results: Population 1 Population 2 At the 0.05 significance level, can we conclude that the mean of population one is greater than the mean of population two? (10 marks) 31. A company wishes to determine whether the proportion of super-experienced auction bidders who fall victim to the “winner’s curse” (i.e. the phenomenon of the winning bid price being above the expected value of the item being auctioned.) is different from the proportion of less-experienced bidders. In the super-experienced group, 29 of 189 winning bids were above the item’s expected value, while in the less-experienced group, 32 of 149 winning bids were above the item’s expected value. At the 0.10 significance level, can we conclude that there is a difference in the proportions of each population? (8 marks) (omit)32. A real estate agent wants to compare the variation in the selling price of homes on the ocean with that of homes one block from the ocean. A sample of 21 homes on the ocean revealed the standard deviation of the selling prices was $15600. A sample of 18 homes that were sold one black from the ocean revealed the standard deviation of the selling prices to be $11330. At the 0.01 significance level, can we conclude that there is more variation in the selling prices of the homes sold on the ocean? (8 marks) 33. The data in the following table resulted from an experiment that used a completely randomized design. Treatment 1 3.8 1.2 4.1 5.5 2.3 Treatment 2 5.4 2.0 4.8 3.8 Treatment 3 1.3 0.7 2.2 Can we conclude that at least two of the means of the treatment groups differ? Use the 0.01 level of significance. (16 marks) 18 34. A six-sided die is rolled 30 times and the numbers 1 through 6 appear as shown in the following frequency distribution. At the 0.10 significance level, ca we conclude that the categories are not all equal? (12 marks) outcome 1 2 3 4 5 6 frequency 3 6 2 3 9 7 35. Traffic experts wanted to determine whether there is a relationship between cell phone use and having a car accident. The data below was gathered. Using the 0.05 significance level, can we conclude that there is a relationship between the two variables? (12 marks) Had accident in last year Cell phone in use Cell phone not in use 25 50 Did not have accident in last year 300 400 36. Use the data below to answer the following questions. x 1 6 2 y 4 8 3 a. Determine the coefficient of correlation. (6 marks) b. Determine the regression equation. (8 marks) c. Determine the value of y ' when x=5. (1 mark) d. Determine the standard error estimate. (5 marks) e. Determine the 95% confidence interval for the mean predicted when x=5. (5 marks) 37. A commuter airline selected a random sample of 25 flights and found that the correlation between the number of passengers and the total weight, in pounds, of luggage stored in the luggage compartment is 0.94. Using the 0.05 significance level, can we conclude that there is a positive correlation between the two variables? (8 marks) 19 20 2.2 Graphical Techniques 2.2.1 Describe the shape of a frequency distribution There are several ways that the information contained in a frequency distribution (what we learned how to construct in the last section) can be graphically displayed to better display the information. Some of these ways are histograms, frequency polygons, bar charts, stem and leaf diagrams, pie charts, and ogives. The next few sections will describe how to analyze and construct these graphical displays. 2.2.2 Construct and analyze a histogram (read pg 32-34) One of the most common ways to portray a frequency distribution is a histogram. After you complete a frequency distribution, your next step will be to construct a "picture" of these data values using a histogram. A histogram is a graphical representation of a frequency distribution. It describes the shape of the data. You can use it to answer quickly such questions as, are the data symmetric? Or where do most of the data values lie? Let’s look at the histogram for the following frequency distribution: Class number 1 2 3 4 5 6 class 250 and under 350 350 and under 450 450 and under 550 550 and under 650 650 and under 750 750 and under 850 Histogram: Notes on histograms: 21 Frequency 4 8 20 8 7 3 In a histogram, the classes are marked on the horizontal axis and the class frequencies are marked on the vertical axis. The class frequencies are represented by the heights of the bars and the bars touch each other. Analyzing a histogram: Below is a histogram developed from the frequency distribution of the prices of vehicles sold at a dealership. The histogram provides an easily interpreted visual representation. We can see at a glance that a total of 23 vehicles sold in the price range of $15,000 to $18,000. We can also readily see that 58 vehicles (72.5%) sold in the range of $15,000 to $24,000. Try answering these additional questions. a. How many vehicles were sold in total? ____________ b. How many vehicles were sold between $21000 and $33000? ____________ c. What is the relative frequency of vehicles sold in the $21000 and $24000 range? ___________ d. Express part c as a percentage. ____________ e. what percentage of vehicles was sold between $12000 and $18000? ____________ Example #1: Construct a histogram given the frequency distribution given below. Class number 1 2 3 4 5 Hours spent studying 0 up to 3 3 up to 6 6 up to 9 9 up to 12 12 up to 15 22 frequency 2 4 7 0 5 Example #2: Answer the following questions on the histogram below represents employee salary data for a privately-owned small company. a. How many people in the company earned between $15,240 and $17,430? b. How many people in the company earned between $6480 and $19,620? c. In what salary range were 23 employees earning? d. How many people in total worked for the company? e. What is the relative frequency people who earned between $6480 and $10,860? f. What percentage of people earned between $15,240 and $26,190? 2.2.3 Construct and analyze a frequency polygon (pg 34-36) Frequency Polygons: Although a histogram does demonstrate the shape of the data, perhaps the shape can be more clearly illustrated by using a frequency polygon. Here, you merely connect the centers of the tops of the histogram bars (located at the class midpoints) with a series of straight lines. The resulting multi-sided figure is a frequency polygon. Let’s look at the frequency polygon for the histogram in section 2.2.2. 23 Example #1: Construct a frequency polygon for the frequency distribution below. Class number 1 2 3 4 5 Number of cars sold 0 up to 20 20 up to 40 40 up to 60 60 up to 80 80 up to 100 frequency 1 4 8 4 0 For extra practice the following exercises are recommended: pg41 #13-16 2.2.7 Construct and analyze an ogive. (not in text) Step 1: To construct on ogive, you must first construct a relative frequency distribution.(section 2.1.5) Step 2: Then you must take the cumulative frequencies of the relative frequencies. Step3: The data contained in the cumulative relative frequency distribution is then graphed. The following example will illustrate. Example#1: Construct an ogive based on the data contained in the frequency distribution below. Number of frequency days of work 250 up to 350 350 up to 450 450 up to 550 550 up to 650 650 up to 750 750 up to 850 4 8 20 8 7 3 24 Ogive: Example #2: Answer the following questions on the ogive below. The following ogive shows the annual transactions counts for a company. a. What is the class interval width? ___________ b. What percentage of transactions are less than 400? ____________ c. About 75% of transactions are less than what value? ____________ d. About 85% of transactions are less than what value? ____________ e. What percent of transactions are less than 300? ____________ f. 100% of the transactions are less than what value? ____________ 2.2.4 Construct and analyze a bar chart. (pg 43) A bar chart can be used to depict any of the types of data. (nominal, ordinal. interval, or ratio) A bar chart is similar to a histogram in that the height of a bar represents the frequency of the class. The bars don’t touch in a bar chart though. 25 Example #1: Construct a bar chart for the following data. Type of electrical equipment TV DVD player CD player Radio Number sold 40 37 22 10 Example #2: Analyze the following bar chart. a. Vendor E had how much sales? ______________ b. How much sales was earned in total by all of the vendors? ______________ (recommended problems: pg47 #22) 2.2.5 Construct and analyze a stem and leaf diagram (not in text) A stem-and-leaf display is a statistical technique for displaying a set of data. Each number is divided into two parts: the leading digit becomes the stem and the trailing digit becomes the leaf. 26 Example #1: Lisa receives the following grades on her accounting assignments” 8, 74, 54, 23, 50, 84, 66, 80, 44, 88, 67, 90, 72, 105, 84 Represent the data in a stem-and-leaf display. stem leaf Example #2: Answer the questions about the stem-and-leaf display below representing the per capita GNP for each country in western Africa. 1. How many countries was data collected for? ______________ 2. Did any of the countries have a GNP of $700? ______________ 3. How many countries have a GNP less than $400? ______________ 4. Did any countries have the same GNP? ______________ 27 2.2.6 Construct and analyze a pie chart. (pg 44 in text) Pie Chart - A pie chart is especially useful in displaying a relative frequency distribution. A circle is divided proportionally into the relative frequency and portions of the circle are allocated for the different groups. Its purpose is to show the relative comparison between parts of a total. Example: Answer the following questions on the pie chart below representing the favourite movie genres in a school of 200 students. a. Horror movies account for what percentage of favourite movies in the school? _________ b. How many students in the school cite horror as their favourite genre? ___________ c. How many students combined cite foreign or romance as their favourite genres? __________ d. If you tried to draw this pie-chart, how many degrees (for the central angle of each sector) would each genre have? Remember: a circle is 360º! genre Comedy Action Romance Drama Horror Foreign Science fiction Central angle measure for sector e. All of your angle measures in part d should add up to what value? recommended problems: pg37#9(omit f), 11a,c,e,12a,d,e pg46#17,20,22 28 Worksheet #2 1. Answer the questions on the following ogive representing the amount of money raised by a class of students. a. What percentage of students raised less than $500? b. 20% of students raised how much money? c. What percentage of students raised less than $30? d. 90% of students raised how much money? e. How much money was raised in total? (i.e. by 100% of the students) 2. Construct a stem and leaf diagram fro the following data. 24 27 28 72 66 52 50 34 39 30 3. Construct a pie chart based on the following data. (A class of students with pets was polled.) pet Dog Cat Fish other Frequency 25 20 10 30 Recommended Problems #2 1. A store has several retail stores in the coastal areas of North and South Carolina. Many of the customers ask the owner to ship their purchases. The following chart shows the number of packages shipped per day for the last 100 days. a. What is the chart called? b. What is the total number of frequencies? c. What is the class interval? d. What is the class frequency for the 10 up to 15 class? 29 e. What is the relative frequency of the 10 up to 15 class? g. On how many days were there 25 or more packages shipped? 2. The following frequency distribution reports the number of frequent flier miles, reported in thousands, for employees of a consulting firm during the first quarter of 2004. Frequent flier miles (in thousands) 0 up to 3 3 up to 6 6 up to 9 9 up to 12 12 up to 15 total Number of employees 5 12 23 8 2 50 a. How many employees were studies? b. Construct a histogram. c. Construct a frequency polygon. 3. An internet retailer is studying the lead time (elapsed time between when an order is placed and when it is filled) for a sample of recent orders. The lead times are reported in days. Lead time (days) 0 up to 5 5 up to 10 10 up to 15 15 up to 20 20 up to 25 Total Frequency 6 7 12 8 7 40 a. How many orders were studied? b. Draw a histogram. c. Draw a frequency polygon. 4. A small business consultant is investigating the performance of several companies. The sale in 2004 (in thousands of dollars) for the selected companies were: corporation Hoden building products J and R printing Long Bay concrete Mancell electric Maxwell heating Mizelle roofing Fourth-quarter sales (thousands of dollars) 1645.2 4757.0 8913.0 627.1 24612.0 191.9 The consultant wants to include a chart in his report comparing the sales of the six companies. Use a bar chart to compare the fourth quarter sales of these corporations. 30 5. A report prepared for the mayor of a city indicated that 56 percent of the city’s tax revenue went to education, 23 percent to the general fund, 10 percent to the counties, 9 percent to senior programs, and the remainder to other social programs. Sketch a pie chart to show the breakdown of the budget. If 3.5 million dollars is generated in tax revenue, how much went to education? 6. Show below are the military and civilian personnel expenditures for the eight largest military locations in the US. Develop a bar chart to represent the data. location St. Louis San Diego Pico Rivera Arlington Norfolk Marietta Fort Worth Washington Amount spent (millions) 6087 4747 3272 3284 3228 2828 2492 2347 2.4 Numerical Techniques 2.4.1 Use the summation notation representation to sum numbers The Greek letter (), capital Sigma, is used to denote the summation of a selection of numbers. If we have a quantitative data set consisting of x1, x2. x3, .....xn this means that x1 is the first measurement in the data set, x2 is the second, and xn is the nth and last measurement in the group. If we have five measurements in a set and they are: x1 = 5, x2 = 3, x3 = 8, x4 = 5, and x5 = 4, then in order to add up this set, we use the Symbol Sigma () as x = x 1 + x 2 + x 3 + x 4 + x 5 = 5 + 3 + 8 + 5 + 4 = 25 2.4.2 Measures of Central Tendency 2.4.2.1/2.4.2.2 Discuss the roles of mean, median, midrange and mode as ways of measuring central tendency in data and calculate them (readings: pg 59: “The population Mean”, pg64-65: “The Median”, and pg65-66: “The Mode”) Measures of Central Tendency: The purpose of a measure of central tendency is to determine the "center" of your data values or possibly the "most typical" data value. Some measures of central tendency are the mean, median, midrange, & mode. The mean: The mean is the most popular measure of central tendency. It is merely the average of the data. 31 The mean is equal to the sum of the data values divided by the number of data values. Mathematically the mean is given as follows: mean = x = x N where N (or n) is the number of values in the data set Take for example: The number of accidents reported over a particular 5 month period was: 6, 9, 7, 23, 5 So the mean of this sample is: x 6 9 7 23 5 50 10 x= N 5 5 Special note: When we talk of mean, we can have a population mean or sample mean: the symbol for population mean is the Greek Symbol (u) the symbol for sample mean is: x The median: The median of a set of data is the value in the center of the data values when they are arranged from smallest to largest. Consequently, it is in the center of the ordered array. Using the accident data set, the median (Md) is found by first constructing an ordered array: 5, 6, 7, 9, 23 so the median here is 7. If there is an even amount of data like 3, 8, 12, 14 then Md is the average of the two center values thus the median for these numbers is (8 + 12)/2 = 10 Note: In our accident data set, one of the five values (23) is much larger than the remaining values - it is what we call an outliner.(an out of whack data value) Notice that the median (Median = 7) was much less affected by this value than was the mean ( x = 10). When dealing with data that are likely to contain outliners (for example, personal incomes or prices of residential housing), the median usually is preferred to the mean as a measure of central tendency, since the median provides a more "typical" or "representative" value for these situations. (omit midrange)The Midrange: Although less popular than the mean and median, the midrange (Mr) provides an easy to grasp measure of central tendency. Notice that it is also severely affected by the presence of an outliner in the data. The midrange is: Midrange = (smallest value) + (largest value) 2 32 For our accident data: Midrange = 5 + 23 = 28/2 = 14.0 2 The mode: The mode of a data set is the value that occurs more than once and the most often. The mode is not always a measure of central tendency; this value need not occur in the center of your data. One situation in which the mode is the value of interest is the manufacturing of clothing. The most common hat size is what you would like to know, not the average hat size. For our accident data there is no mode since all values occur only once but let’s consider this data set: 4, 8, 7, 6, 9, 8, 10, 5, 8 Here 8 occurs three times which is most often so Mode = 8. Note: There can be more than one mode in a set of data. For example: 1, 1, 3, 5, 7, 7, 9 There are two modes for the data above. (1 and 7) Example: A sample of ten was taken to determine the typical completion time (in months) for the construction of a particular model of Brockwood Homes: 4.1, 3.2, 2.8, 2.6, 3.7, 3.2, 9.4, 2.5, 3.5, 3.8 Find the a. mean b. median c. midrange d. mode 2.4.3 Measures of variation (pg74:”Measures of Dispersion: The Range”,pg77-78: “Variance and Standard Deviation”) Measures of Variation: Variability: Variability provides a quantitative measure of the degree to which scores in a distribution are spread out or clustered together. The purpose of measuring variability is to determine how spread out a distribution of scores is. Are the scores all clustered together, or are they scattered over a wide range of values? The range: 33 The range is the numerical difference between the largest value and the smallest value in a data set. If the number of accidents reported over a 5 month period was 6, 9, 7, 23, & 5. The range for this data set is: range = (largest value) - (smallest value) = 23 - 5 = 18 The range is a rather crude measure of variation, but it is an easy number to calculate and contains valuable information for many situations. Stock reports generally give prices in terms of ranges, citing the high and low prices of the day. Note: The value of the range is strongly influenced by an outliner in the data set. Standard deviation: The standard deviation is the most commonly used and the most important measure of variability. Standard deviation uses the mean of the distribution as a reference point and measures variability by considering the distance between each score and the mean. It determines whether the scores are generally near or far from the mean. That is, are the scores clustered together or scattered? In simple terms, the standard deviation approximates the average distance from the mean. The standard deviation is the square root of the variance. The standard deviation is a measure of the average distance between the values of the data in the set and the mean. If the data points are all similar, then the standard deviation will be low (closer to zero). If the data points are highly variable, then the standard variation is high (further from zero). Calculating the variance and standard deviation given a set a numbers representing a population. (Note: the formulas for a sample is different than for a population) Step One: Calculate the mean of the population. (represented as u) Step Two: Calculate ( X ) for each number (X represents a number from the population) Step Three: Square each difference from Step 2. i.e. ( X ) 2 Step Four: Get the mean of each of the squares from Step 3. (i.e. number of numbers in the data. This is your variance. 34 ( X ) N 2 , where N is the Step Five: If the question asked for the standard deviation of the population, then you would just take the square root of your variance. i.e. It is helpful to do the calculations in a chart such as the one used below to find variance and standard deviation. Example#1: A stores sells the following numbers of TV’s over a week. 3, 2, 5, 0, 7 a. Find the variance in the data. b. Find the standard deviation. Note, if you were finding the standard deviation of a sample, the formula would be 35 36 36 37 2.4.3.2 Discuss the interpretation of standard deviation. As mentioned above, the standard deviation is a good measure of dispersion, or how spread out the data is. The bigger the standard deviation the more variation there is in the data and the lower the standard deviation, the lower the variation in the data. The standard deviation tells us, on average, how far a given number is from the mean. OMIT 2.4.3.3! 2.4.3.3 Discuss Chebyshev’s Theorem (pg 82: “Chebyshev’s Theorem) For any set of observations, the proportion of the values that like within k standard 1 deviations of the mean is at least 1 2 , where k is any constant greater than 1. k Examples: 1. Approximately what proportion of data will lie within 2 standard deviations of the mean? 2. Approximately what proportion of data will lie within 1.4 standard deviations of the mean? recommended extra problems:page 84#49,#50 2.4.4 Measures of Position 2.4.4.2/2.4.4.2 Define and calculate percentile and quartile. (pg 97-99:”Quartiles, Deciles, and Percentiles”—can omit deciles though!) Percentiles divide a set of numbers into 100 equal parts. For example, if your GPA was in the 66th percentile, than 66 percent of the students had a lower GPA. A formula can be applied to determine the location in a list of numbers for a given percentile. Lp (n 1) P 100 37 38 L p represents the location of the desired percentile n is the number of observations(numbers) P is the desire percentile. Examples: 1. Use the following data to answer the questions. 1, 3, 4, 5, 6, 7, 7 a. Find the 25th percentile of the above data. b. Find the 50th percentile of the above data. c. Find the 75th percentile of the above data. Note: Percentile problems do not always work out perfectly! This is how you do them if your L p does not work out to a whole number. #2. Use the data below to answer the following questions. 20, 4, 7, 22, 11, 14, 1, 8 a. Find the 36th percentile. 38 39 b. Find the 62nd percentile. c. Find the 83rd percentile. Finding quartiles Quartiles divide a set of observations (numbers) into four equal parts. (Quarters) The first quartile, usually labeled Q1 , is the value below which 25 percent of the observations occur. This would be the 25th percentile as well. The second quartile, Q2 , is the value below which 50 percent of the observations occur. This is also the median. This would be the 50th percentile as well. The third quartile, usually labeled Q3 , is the value below which 75 percent of the observations occur. This would be the 75th percentile as well. So if you are asked to calculate the first, second, or third quartile, just use the appropriate P percentile value in the formula Lp (n 1) . 100 Examples 3. Use the data set below to answer the following questions. 2, 5, 50, 11, 46, 16, 22, 36, 15, 8 a. Find the first quartile. 39 40 b. Find the third quartile. c. Find the second quartile. Math Worksheet #3 1. The data below represents the number of times an item is returned in a store each day. 4 4 5 6 8 10 10 15 15 17 18 20 22 a. Find the mean. b. Find the median. c. Find the mode. d. Find the midrange. 2. The standard deviation for class A is 2.4 and the standard deviation for class B is 13.5. Which class has a greater variety of test scores? 3. If a set of data has a standard deviation of 10, what, on average, is the distance of a given number from the mean? 4. What is the standard deviation if the variance is 5.8? 5. What is the variance if the standard deviation is 15.6? 6. Use the data below to answer the following questions. 2 17 18 9 9 18 a. Calculate the mean. b. Calculate the variance. c. Calculate the standard deviation. 7. On average, the numbers in the data set in #6 fall within ___________ units of the mean. 8. A set of data has a standard deviation of 9.2. Does this set of data have more or less variation than the data in #6? Why? 9. Why is the standard deviation a better measure of dispersion than the range? 10. List four measures of central tendency and three measures of dispersion. 40 41 11. Use the data below to answer the following: 2 8 3 10 6 17 29 40 36 a. Find the 70th percentile. b. Find the 27th percentile. c. Find the third quartile. d. Find the first quartile. 12. a. Lisa scored higher than 72% of her class. What would be her percentile? b. Janice scored lower than 11% of her class. What would be her percentile? 13. What percentage of data will lie within 1.56 standard deviations of the mean? (Hint: Use Chebyshev’s Theorem) 2.5 Grouped Data (not in text) Grouped data: Sometimes we may have to work with data in the form of a frequency distribution, called grouped data, when the raw data are not available. We do not have the data values used to make up this frequency distribution, so we are forced to approximate the values when calculating the mean and standard deviation. Let’s take the following data for example: Class Number Class (age in years) Frequency 1 20 and under 30 5 2 30 and under 40 14 3 40 and under 50 9 4 50 and under 60 6 5 60 and under 70 2 =36 in total The approximation for the sample mean here is: X ( f mid ) N where f is the frequency of each class where mid is the midpoint of each class where N is the sample size (i.e. add up all the frequencies!) 41 42 So for our above example: X ( f mid ) N Class 20-30 frequency 5 30-40 14 40-50 9 50-60 6 60-70 2 The sample variance can be found using this formula: ( f mid ) ) 2 Sample Variance = s 2 ( f mid 2 N N 1 Class frequency 20-30 5 30-40 14 40-50 9 50-60 6 60-70 2 42 43 (OMIT 2.6) 2.6 Index Numbers Index numbers: How many times have you heard a remark such as "Fifteen years ago we could have bought that house for $20,000 but now its $120,000"? To compare effectively the change in the price or value of a certain item between any two time periods, we use an index number. An index number (or index) measures the change in a particular item (or collection of items) between two time periods. The formula for calculating index numbers: Index number = x 100 basevalue x: number you want to change to an index number basevalue: the number that represents the base rate Example 1: The average hourly wages for production workers at Kessler Toy Company in 1975, 1980, 1985, and 1990 are shown below. Year Wage 1975 $6.40 1980 $7.05 1985 $8.50 1990 $10.90 If the base is chosen to be 1975, compute the index numbers for: a. 1980 b. 1985 c. 1990 43 44 3.0 Elementary Probability 3.1 Introduction 3.1.1 Define the terms: probability, experiment, event, outcome, and sample space. Probability: A probability is a measure of the likelihood that an event will occur when the experiment is performed. Event: An event consists of one or more possible outcomes of an experiment. Experiment: An activity for which the outcome is uncertain Outcome: An outcome is any particular result of an experiment. Sample Space: The set of all possible outcomes of an experiment Examples: Experiment: Rolling one die Events: A = rolling a one D = rolling a four B = rolling a two E = rolling a five C = rolling a three F = rolling a six Experiment: Taking a Statistics midterm Events: A = pass B = fail 3.1.2 State and describe the methods for assigning probabilities Methods of assigning Probabilities: 1. Intuition: For example, the sports announcer claims that Shelia has a 90% chance of breaking the world record in the upcoming 100 meter dash. The statement was probably developed by looking at Shelia's past record and the announcer's confidence in her ability as a basis for this prediction. 2. Relative frequency: For example, The Right to Health Lobby claims the probability is 40% of getting a wrong report from the medical laboratory. This probability comes from a sample of 100 reports in which 40 were wrong so it was said that 40/100 = .40 so we said that the probability of getting a wrong report is 40% based on this information. 3. Formula for equally likely outcomes: Probability of an event = number of outcomes favorable to the event total number of outcomes Example: Let’s say we roll a die, now there are 6 different outcomes we can have (1,2,3,4,5,6) which is its sample space. If we wanted to know what the probability of rolling a six was we would say there is only one outcome in which we could have a six. So using our formula for equally likely outcomes: Probability of getting a six = 1/6 = 0.167 44 45 3.1.3 Explain the use of the word probability Probability represents the likelihood of an event occurring. The notation used is as follows: The probability of event A happening: P(A) If A = rolling an even number on a die than P(A) = 3/6 = 0.5. This could also be written in long form instead of using the letter A. i.e. P(rolling an even number)=0.5 Single event Probability Problems for Practice: 1. What is the probability of rolling a 4 on a die on a single roll? ________ 2. What is the probability of rolling a 2 or a 5 on a die on a single roll? ________ 3. What is the probability of rolling a 1,2,3,4 or 5 on a die on a single roll? ________4. What is the probability of drawing the queen of hearts from a deck of 52 cards on a single draw? ________ 5. What is the probability of drawing a queen from a deck of cards on a single draw? ________ 6. What is the probability of drawing a spade from a deck of cards on a single draw? ________ 3.2 Counting Problems (pg142-143) “Principles of Counting” 3.2.1 State and use the formula to determine the number of possible outcomes of an experiment Counting Problems: Fundamental Counting Principle: Counting rules determine the number of possible outcomes that exist for a certain broad range of experiments. They can be extremely useful in determining probabilities. The question we wish to answer here is, for a particular experiment, how many possible outcomes are there? We use the following rule: n1 = the number of ways of filling the first slot n2 = the number of ways of filling the second slot after the first slot is finished n3 = the number of ways of filling the third slot after the first two slots are filled . nk = the number of ways of filling the kth slot after filling slots 1 through (k - 1) # of possible outcomes = n1 x n2 x n3 ..... nk Let’s look at the following examples: 1. John has 4 shirts and 2 pairs of pants. How many different outfits can he make? 45 46 2. A restaurant serves 3 items for breakfast and 5 different drinks. How many different pairings can be made? 3. Kelly has 6 different purses, 3 different pairs of shoes, and 2 belts. How many combos are there? 4. When ordering a new car, you have a choice of eight interior colors, ten exterior colors, and four roof colors. How many possible color schemes are there? Worksheet #4 1. Use the frequency distribution below to find the mean. cost ($) 15-25 25-35 35-45 45-55 frequency 2 5 0 7 2. The cost of a Nike jacket was $85 five years ago, and today it is $120. If you use the cost five years ago as the base rate, what is the index number for the cost of the jacket today. 3. There are 10 balls in a bag. Three are red and seven are blue. a. What is the probability of picking a red ball? b. What is the probability of picking a blue ball? 4. You have a standard deck of cards. Determine the following probabilities. a. picking the ace of spades b. picking an ace c. picking a red ace d. picking a red card 5. A restaurant has three types of drinks, five breakfast options, nine lunch options, and two dessert options. How many combinations of dinners are possible. Note: For the next 2 sections (Permutation and Combinations) you will be determining the number of ways you can “PICK” or “CHOOSE” a certain number of things from a bigger number of things. So be on the look-out for these key words in problems. 46 47 3.3 Permutations (pg143) 3.3.1 Define Permutation A permutation is any arrangement of r objects selected from a single group of n possible objects. 3.3.2 Analyze permutations of a set of data. For permutations, the order is important, as you will see in the following example! Say for example you have three people(Tim, Sarah, and John) and you want to pick 2 of them. These are the different permutations you could come up with. i.e. Tim, Sarah Tim, John Sarah, Tim Sarah, John John, Sarah John, Tim NOTE! The order “Tim, Sarah” and “Sarah, Tim” are counted as 2 separate permutations!!!!! (In the next section on “Combinations”, they will only be counted as 1) So it is important for you to ask yourself if the order is important. If it is, then AB and BA are counted as two separate possibilities. Example: There are 4 people at an event. 2 people must be picked from the 4 to win two different door prizes. (first place and second place) How many ways can the prizes be distributed? 3.3.3 Define the factorial function. A factorial is represented by an exclamation mark. It is defined as: n ! n (n 1)(n 2)..........3 2 1 Examples: Calculate the following factorials. a. 5! b. 7! c. 6! 3! 47 48 3.3.4 Calculate permutations using the factorial function. There is a formula to use to quickly determine the number of permutations, rather than writing out all the combinations like we did in 3.3.2 n Pr n! (n r )! where: n represents the total number of objects r represents the number of objects to be selected Example #1: Just try to do the previous example that we worked out the long way using this formula. We had to pick 2 people out of 4 and the order of the seating mattered. REMEMBER: This formula is used only when the order is important. (i.e. when “Tim, Lisa” and “Lisa, Tim” count as two different answers. n Pr n! (n r )! Example #2: a. Calculate 8 P3 b. You are having a dinner party and want to arrange one side of a table. There are 6 chairs, but there are 9 people. How many different arrangements are possible if order matters? c. You are developing a quiz. You can pick 4 questions from a list of 9 possible questions. If the ordering of the questions matters, how many possible arrangements are there? d. You need to make up a four digit students number. How many possible arrangements are there to give to students? (Remember there are10 possible numbers to choose from) 48 49 3.4 Combinations 3.4.1 Define combination (pg145) If the order of the selected objects is NOT important, the selection is called a combination. 3.4.2 Analyze the number of combinations possible To illustrate the idea of combinations, we will first do a problem the long way. It will help illustrate the difference between a permutation and a combination. Example: There are 3 people.(Tim, John, and Sarah) You want to make a group of two students. How many different groups of 2 are there? i.e. Tim, Sarah Tim, John Sarah, Tim Sarah, John John, Sarah John, Tim BUT, notice that a group with Tim and Sarah in it is THE SAME as a group with Sarah and John in it. So there are only 3 groups. Hence when the order is not important, as above, there are less combinations, than permutations. 3.4.3 Calculate combinations using the factorial formula. n! n Cr r !(n r )! where: n represents the total number of objects r represents the number of objects to be selected n! n Cr r !(n r )! Examples: 1. 3 C2 can be used to answer the example above. 2. You need to pick 5 colours from a list of 8 colours. How many different groups of 5 are there if the order of the colours is not important? 3. You need to pick 3 people for a committee from a list of 9. How many different groupings are possible if the ordering does not matter? 49 50 4. You are picking 4 lotto numbers from a bag containing 10 numbers. How many different arrangements are possible? (OMIT 3.4.2) 3.4.2 Use Pascal’s Triangle to determine the number of combinations of sets in a set of data Pascal's Triangle for Combinations: 1 1 1 1 1 1 1 1 1 1 1 8 9 10 15 28 20 56 5 15 35 70 126 210 1 10 35 84 120 1 4 10 21 36 45 3 6 5 7 1 3 4 6 1 2 252 21 56 126 1 6 1 7 28 1 8 1 84 36 9 1 210 120 45 10 1 . . . To create this triangle we start with our first two rows of 1 & 1 1, then we go to the next row with ones on the outside then add the two numbers directly above it; for example for the second row to get the two we add 1 + 1 to get two, and so on. Pascal's triangle is another way of finding the number of combinations for a particular situation. Let's say you have five hats on a rack, and you want to know how many different ways you can pick two of them to wear. It doesn't matter to you which hat is on top, so the order of selection does not matter (This is a combination). To solve this problem we could use our combination formula: 5 C2 5! = 2!(5 2)! or we could use Pascal's Triangle in which we look at the fifth row, second number which is of course is 10. Note: This triangle starts at Row 0, in other words the 1 at the top is considered to be Row 0 then the next row is Row 1, and so on. The first number in every row is numbered 0 as well. 50 51 Lets say we had 10 hats on the rack, and you want to know how many different ways you can pick 5 of them then we could use our formula and get: 10 C5 or we can look at the 10th row, 5th number which of course is 252. (recommended problems: pg 146 #33, 34, 35, 38, 39, 40) 3.6 Addition Rules 3.6.2/3.6.3 Define mutually exclusive events and be able to determine whether events are mutually exclusive or not. Mutually Exclusive: The occurrence of one event means that none of the other events can occur at the same time. (i.e. they can’t be both!) The events “is male” and “is female” are mutually exclusive since they can’t occur ath the same time. (i.e. you can’t be male and female) Similarly if you flip a coin once, the events “is heads” and “is tails” are mutually exclusive since a number can’t be both odd and even. Examples: Determine whether the following are mutually exclusive or not. (hint: ask yourself “Is it possible to be both at the same time?” If yes, then they aren’t mutually exclusive, and if no, they are mutually exclusive.) 1. Event A: a number’s first digit is 2 Event B: a number’s second digit is 6 2. Event A: a number’s first digit is 1 Event B: a number’s first digit is 3 3. Event A: is taller than 5 feet Event B: is shorter than 5 feet 4. Event A: has visited Florida Event B: has visited France 5. Event A: likes strawberries Event B: likes blueberries 6. Event A: likes strawberries Event B: hates strawberries 51 52 7. Event A: is a king (in a deck of cards) Event B: is a heart (in a deck of cards) 8. Event A: is a heart Event B: is a spade 9. Event A: got an “A” on the first test Event B: got a “B” on the first test 10. Event A: has brown hair Event B: has blue eyes 3.6.1 Use the addition rules to determine the probability of an event (pg 131) P(A or B) = P(A) + P(B) – P(A and B) Examples: 1. What is the probability that a card chosen at random from a standard deck of cards will be either a king OR a heart? (there are 4 kings, 13 hearts and 1 king of hearts) 2. A bag contains 10 balls. 3 balls: white 5 balls: red 2 balls: yellow You must pick one ball from the bag. a. What is the probability that the ball you pick is white? b. What is the probability that the ball is white or red? 3. The data below is gathered from a class of 30 students. sport played softball hockey softball and hockey frequency 10 20 5 a. What is the probability that a student plays softball? 52 53 b. What is the probability that a student plays softball or hockey? 4. What is the probability of rolling a 3 or a 5 on a single roll of a die? 5. The hourly wages of a group of 30 students that each have one part time job is shown. hourly wage($) 7 8 9 frequency 20 6 4 a. What is the probability that a student’s wage is $7? b. What is the probability that a student’s wage is $7 or $9? Worksheet for 3.3 1. A bag contains 10 different candies. How many ways can you chose 5 candies from a bag? (order doesn’t matter) 2. You are buying a bed and there are 10 different frames you can chose and there are 4 types of bedspreads. How many combinations of frames and bedspreads can you have? 3. a. An area has eight plots of land. Five families want to move into the area. How many ways can you arrange the five families? b. Five people are being chosen from a group of eight to go on a trip. How many ways can you pick five people? 4. A flight attendant has 11 seats available and must seat 7 passengers. How many ways are there to arrange the passengers? 5. a. There are seven aisle seats at a concert and there are five people that will be picked for these seats. How many ways can the students be arranged? 53 54 b. Nine concert-goers are going to be selected from a group of eleven to meet the band. How many ways can they be picked? c. Two out of eight dogs are going to be awarded first and second prize. How many ways can the prizes be awarded? 6. State whether each of the following events are mutually exclusive or not. a. Event A: born in 1980 Event B: born in 1982 b. Event A: born in 1980 Event B: born in Canada c. Event A: has type A blood Event B: has type O blood d. Event A: has two children Event B: has no children e. Event A: likes dogs Event B: likes cats 7. There are 28 students in a class. Eight of them are nine years old and the rest are ten. One student is selected. a. What is the probability of selecting a nine year old? b. What is the probability of selecting a nine OR a ten year old? 8. Use the table below to answer the following questions. A class of 30 students was surveyed for the following information. Video system X-Box 360 PS3 both none frequency 10 7 3 16 If one student is selected from the class: a. What is the probability of the student having an X-Box 360? b. What is the probability that they will have an X-Box 360 OR a PS3? 54 55 3.7 Conditional Probability 3.7.1 Define the term conditional probability Conditional probability is the probability of a particular event occurring, given that another event has occurred. 3.7.4 Define independent events Two events are independent if the occurrence of one event has no effect on the probability of the occurrence of another event. Example #1: two independent Events: the flipping of a coin twice. If you flip a coin twice. The result of the first flip has NO EFFECT on the result of the second flip. So, the result of the first flip is INDEPENDENT of the result of the second flip. Example #2: two independent events: picking cards (and replacing the picks) If you pick a card from a deck and then put this card back before you pick the next card, then the result of the first pick has NO effect on the result of the second pick. So the two picks are INDEPENDENT events. Note: If you had not put your first pick back in the deck before you picked again, then it would affect the result of the second pick. This is because the second pick could be only one of 51 options, rather than 52. In this case, the events would NOT be independent. Example #3: two independent events: picking two balls out of a bag You have a bag of 10 balls of various colours. If you pick a ball and then put it back before you pick another ball, then the result of the first pick has NO EFFECT on the result of the second pick. Hence, the events are INDEPENDENT. Note: If you did not put the first pick back, then the events would not be independent, since the second pick would be one of 9 options, not 10. Helpful Hint: When you replace your first pick, the events are independent, but if you don’t replace it, then they are NOT independent. 3.7.5 Determine whether the following events are independent. 1. Event A: picking a card and NOT putting it back Event B: picking another card 55 56 2. Event A: rolling a die Event B: rolling the die again 3. Event A: flipping a coin Event B: flipping it again 4. Event A: picking a ball from a bag and putting it back Event B: picking another ball from the bag 5. Event A: picking a ball from a bag and NOT putting it back Event B: picking another ball from the bag 3.8 Multiplication Rules 3.8.1 Use the “special rule” for multiplication The “special rule” for multiplication applies to events that are independent of each other. The multiplication fule is used for “ AND” problems. (Note: We did “OR” problems yesterday) Up until now, we only picked a car ONCE, or rolled a die ONCE, or flipped a coin ONCE. NOW, we are doing TWO or more things IN A ROW. The “special rule” for independent events is: P(A and B)=P(A)×P(B) Or more generally for more than one independent event events: P(A and B and C and D ..etc)=P(A)×P(B)×P(C)×P(D)….etc Remember: This rule only applies to INDEPENDENT events!!!!!! Examples: 1. What is the probability of flipping a coin and getting a tail AND then flipping it again and getting a head. Since the events are independent we can use: 2. What is the probability of rolling a 4 and then rolling it again and getting a 2? Since the events are independent we can use: 56 57 3. There are 10 balls in a bag. Four are red and 6 are white. a. What is the probability of picking 2 white balls in a row if you put back your fist pick? Since the events are independent (cause we put our first pick back)we can use: b. What is the probability of picking a red ball, then a white ball, and then another white ball if you replace each ball back into the bag after you pick it? Since the events are independent (cause we keep putting our picks back)we can use: c. What is the probability of picking 3 red balls in a row if you put each pick back each time? Since the events are independent (cause we keep putting our picks back)we can use: 4. What is the probability of picking a red card and then a jack if you replace your first pick? (13 red cards, 4 jacks, 52 cards in a deck) 3.8.2 Use the “general” multiplication rule to compute probabilities. This rule can be used on “AND” problems (i.e. events happening in a row) when the events aren’t independent! (This rule works always---even for independent events) P(A and B) = P(A)×P(B given that A happened) or more generally for more than one event: P(A and B and C and D) = P(A)×P(B given that A happened) ×P(B given that A happened) ×P(B given that A happened) Examples: 1. There are 10 balls in a bag. Four are red and six are white. a. What is the probability of picking a red ball and then picking another red ball if you don’t put back your first pick? 57 58 b. What is the probability of picking a red ball and then picking a white ball if you don’t put your first pick back? c. What is the probability of picking a white ball and then a red ball and then a white ball if you don’t put any of your picks back? d. What is the probability of picking 4 white balls in a row if you don’t put any picks back? 2. a What is the probability of picking a heart and then a diamond if you don’t put your first pick back? (a 52 card deck has 13 hearts and 13 diamonds) b. What is the probability of picking a jack and then another jack if you don’t put your first pick back? (a 52 card deck has 4 jacks) c. What is the probability of picking a red card and then a black card and then another red card if you don’t put your picks back? (a 52 card deck has 26 red cards and 26 black cards) 3. There are 10 rolls of film in a box, three of which are defective. (the rest aren’t) a. What is the probability of picking a defective roll and the picking another defective roll if you don’t put your first pick back? b. What is the probability of picking a defective roll and then picking another defective roll if you put your first pick back? 58 59 c. What is the probability of picking a non-defective roll and then a defective roll if you do not put your first pick back? d. What is the probability of picking 3 non-defective rolls in a row if you put your pick back in the box each time? e. What is the probability in part d if you do not put your pick back each time? f. What is the probability of picking a defective roll, then another defective roll, and then a non-defective roll if you replace your pick each time? g. What would be your answer to part f if you did not replace your pick each time? h. What is the probability of getting three defective rolls in a row if you do not replace your picks? 59 60 3.9 Bayes’ Theorem 3.9.3 Use Bayes’ Theorem to calculate probabilities To solve Bayes’ Theorem problems, it is very helpful to use a Tree Diagram. In Bayes’ Theorem problems, you will divide your data into groups more than once. In probability theory, Bayes’ Theorem is a means for revising predictions in light of relevant evidence, also known as conditional probability. The following examples will help illustrate. Examples: 1. A class consists of 60% men and 40% women. Of the men, 25% passed the test, while 45% of the women passed. a. Illustrate the information with a tree diagram. b. What is the probability that if a student chosen at random passed the test, then they are male? c. What is the probability that if a student chosen at random failed the test, then they are male? d. What is the probability that if a student chosen at random passed the test, then they are female? 60 61 2. There are two different suppliers of a particular part. Company A supplies 80% of the parts and company B supplies 20%. 5% of the parts supplies by company A and 9% of the parts supplied by B are defective. a. Construct a tree diagram to represent the information. b. Suppose you reach into the bin and select a part and find it is nondefective. What is the probability that it was supplied by A? c. Suppose you reach in and select a part and find that it is defective. What is the probability that it was supplied by B? d. Suppose that the part you pick was defective. What is the probability that it was supplied by A? e. Suppose that the part you pick was non-defective. What is the probability that it was supplied by B? 61 62 Math Worksheet for 3.7 1. There are 50 balls in a bag. 20 are blue, 10 are white, 15 are red, and 5 are green. a. What is the probability that you pick a blue ball and then put it back and pick a red ball? b. What is the probability that you pick a blue ball and then a red ball if you don’t put your first pick back? c. What is the probability of picking a green ball and then another green ball if you don’t put your first pick back? d. What is the probability of picking a red ball, and then a green ball if you put your first pick back? e. What is the probability of picking a blue ball and then another blue ball if you put your first pick back? 2. What is the probability of stats a tail and then another tail and then a head? 3. There are 27 toys in a box. 11 of them are defective, and the rest are non-defective. a. What is the probability that you pick a defective toy and then a non-defective toy if you do not replace the first toy? b. What is the probability that you pick a non-defective toy and then another nondefective toy if you replace your first pick? c. What is the probability that you pick a defective toy and then another defective toy if you do not replace the first pick? d. What is the probability of picking 3 non-defective toys in a row if you replace your picks? e. What is the probability of picking 3 defective toys in a row if you do not replace your picks? 4. What is the probability of 5 heads in a row? 5. In a population, 70% of the people have a certain disease. A test is developed that has a 40% chance of detecting the condition in a person who has it, and a 10% chance of falsely indicating it in a person who does not have it. If a person gets a positive test result, what is the probability they have the condition? (a tree diagram will help) 62 63 6. A class consists of 75% men and 25% women. Of the men, 60% of them passed the test, and of the women, 65% passed the test. a. construct a tree diagram to represent this data. b. If a person passed the test, what is the probability that they are male? c. If a person failed the test, what is the probability that they are female? Test #1 up to end of Unit 3! 63 64 4.0 Discrete Probability Distributions 4.1 Introduction 4.1.1 Discuss the concept of a random variable 4.1.2 Explain the difference between a discrete random variable and a continuous random variable When the random variable can take on only a finite number of values or a countable number of values, we say that the variable is a discrete random variable. A discrete random variable can assume only certain clearly separated values resulting from a count of some item of interest. Example: the number of highway deaths in Ontario on the Labour Day weekend may be 5, 6, 7 ... etc. (not 5.3). This does not mean that the discrete random variable may not assume fraction values - but it does mean that there is some distance between the values. The result is still in a count - example: 12 stocks increased by $0.25 or 1/4. When the random variable can take on any number on the number line - not just counting - we say that the variable is a continuous random variable. We can have 2.5 or 1.8 mm of rainfall for example or in a high school track met, the winning time for the mile run may be reported as 4 minutes 20 seconds; 4 minutes 20.2 seconds; or 4 minutes 20.3416 seconds, etc. NOTE: If the problem involves counting something, the resulting distribution is usually a discrete probability distribution. If the distribution is the result of a measurement, then it is usually a continuous probability distribution. 64 65 Example: State whether the following random variables are discrete or random. 1. the number of letters delivered on time: ___________________ 2. the length of a package: ___________________ 3. the number who attended: ___________________ 4. the height of each student: ___________________ 4.1.3 Show how probability functions assign probabilities to different values of random variables. The following is a probability function that assigns probabilities to a random variable. 4.2 The mean of a random variable 4.2.1 Show how the mean of a probability function helps to give the expected value of a probability function. Special Note: The mean of a probability distribution is often referred to as the value" of the distribution. "expected This formula directs you to multiply each outcome (x) by its probability P(x); and then add the products. 65 66 Example 1: Calculate the mean of the following probability function representing the probabilities of batters getting a given number of strikes. x P(x) 0 0.2 1 0.35 2 0.15 3 0.3 4.3 Measuring Chance Variation 4.3.1 Show how the variance of a probability function helps to give the spread of a probability function. The standard deviation of a discrete probability distribution is found by taking the square root of 2, thus square root of 2 66 67 variance of probability distribution: Example 1: Find the variance of the following probability distribution given in 4.2.1. x P(x) 0 0.2 1 0.35 2 0.15 3 0.3 Example 2: Find the following using the population distribution below. x P(x) 0 0.1 1 0.15 2 0.2 3 0.4 4 0.15 67 68 a. What is the probability of exactly 2? __________ b, What is the probability of exactly none? ___________ c. What is the probability of at least 2? _______________ d. What is the probability of more than 3? _________________ e. What is the probability of no more than 1? _________________ f. Calculate the mean. g. Calculate the variance. 4.4 Binomial Distribution 4.4.1 Describe a binomial experiment. One of the most widely used discrete probability distributions is the binomial probably distribution. 68 69 Illustrations of each characteristic: 1. An outcome is classified as either a "success" or "failure." For example, 40 percent of 2. 3. 4. the students at a particular university are enrolled in the Business program. For a selected student there are only two possible outcomes - the student is enrolled in the program (a success) or is not enrolled in the Business program (failure). The binomial distribution is the result of counting the number of successes in a fixed sample size. If we select 5 students, 0, 1, 2, 3, 4 or 5 could be enrolled in the Business program. This rules out the possibility of 3.75 of the students being enrolled. There cannot be fractional counts. Probability of success stays the same - in this example the probability of a success remains at 40% for all five students selected. Trials are independent - this means that if the first student selected is enrolled in the Business program, it has no effect on whether the second or fourth one selected will be in the Business Program. (OMIT 4.4.2) 4.4.2 State and graph the formula for a binomial experiment. There will be TWO ways to get the probability values for a binomial experiment. The first way is to use the formula below and the second way is to use a table of values and look up your answer. This saves a LOT of time, but be careful to use the right method. If you are asked to use the formula to come up with a probability then use the following method. To construct a binomial distribution, we need to know: 1. the number of trials - designated n 2. the probability of successes on each trial - designated Where: nCx denotes a combination (i.e. use the Combination formula) is the number of trials is the number of observed successes is the probability of success on each trial n x 69 70 Examples: 1. There are five flights daily from Halifax. Suppose the probability that any flight arrives late is 0.20. Use the binomial formula to determine the following probabilities. a. What is the probability that none of the flights are late today? b. What is the probability that exactly one of the five flights will arrive late today? c. What is the probability that exactly three of the five flights will arrive late today? 70 71 (OMIT) 4.4.3 Solve Binomial Distribution Problems using the binomial distribution tables. Unless it states that you must use the binomial distribution formula, USE THE TABLES! They are sooooo much easier and quicker to use. There are a bunch of tables. To figure out which one you need you have to know your n value: which represents the number of trials. You also need to know your value (your probability of success on each trial) Examples: 1. Five percent of the gears produced by a machine are defective. a. What is the probability that out of 6 gears selected at random, none will be defective? ________ b. What is the probability that out of the 6 selected at random, exactly one will be defective? ________ c. What is the probability that of the 6, two will be defective? ________ d. What is the probability that of the 6, three will be defective? ________ e. What is the probability that of the 6, four will be defective? ________ f. What is the probability that of the 6, five will be defective? ________ g. What is the probability that of the 6, all will be defective? ________ h. With the information from a-g, graph a binomial probability distribution. *i. What is the probability that less than 3 will be defective? ________________________________ *j. What is the probability that at least 3 will be defective? ________________________________ *k. What is the probability that more than 3 will be defective? ________________________________ 2. Eighty percent of employees use direct deposit. Suppose we select a random sample of 7 employees and count the number using direct deposit. What is the probability that: a. none use direct deposit? ____________ b. exactly 3 uses direct deposit? ____________ c. exactly 4 uses direct deposit? ____________ 3. The postal service reports ninety percent of its first class mail delivered within 2 days. If 5 letters are randomly selected, what is the probability that: a. exactly one arrives within 2 days? ____________ b. exactly 3 arrives within 2 days? ____________ c. all arrives within 2 days? ____________ 71 72 4.5 The mean and standard deviation of the binomial distribution 4.5.1/4.5.2 Calculate the mean and standard deviation of a binomial distribution. The mean of a probability distribution is often called the expected value of the distribution. This terminology reflects the idea that the mean represents a "central point" or "cluster point" for the entire distribution. The mean and variance of a binomial distribution can be computed by these formulas: Examples: Calculate the mean, variance, and standard deviation for the binomial distribution problems #1-3 above. For #1: For #2: For #3: Worksheet for 4.0 1. Compute the mean and variance of the following discrete probability distribution. x 0 1 2 3 P(x) 0.2 0.4 0.3 0.1 72 73 2. Compute the mean and variance of the following discrete probability distribution. x 2 8 10 P(x) 0.5 0.3 0.2 3. Which of the following variables are discrete and which are continuous random variables? a. The number of new accounts established by a salesperson in a year. b. The time between customer arrivals to a bank ATM. c. The number of customers in Big Nick’s barber shop. d. The amount of fuel in your car’s gas tank last week. e. The number of minorities on a jury. f. The outside temperature today. 4. Dan Woodward is the owner and manager of Dan’s Truck Stop. Dan offers free refills on all coffee orders. He gathered the following information on coffee refills. Compute the mean, variance, and standard deviation for the distribution of number of refills. x 0 1 2 3 Percent 30 40 20 10 5. The director of admissions at a university in estimated the distribution of student admissions for the fall semester on the basis of past experience. What is the expected number (i.e. what is the mean) of admissions for the fall semester? Compute the variance and the standard deviation of the number of admissions. 73 74 admissions Probability 1000 06 1200 03 1500 0.1 6. The following table lists the probability distribution for cash prizes in a lottery conducted at Lawson’s Department Store. x 0 10 100 500 P(x) 0.45 0.30 0.20 0.05 If you buy a single ticket, what is the probability that you win: a. exactly $100? b. at least $10? c. no more than $100? d. Compute the mean, variance, and standard deviation of this distribution. 7. You are asked to match three songs with the performers who made those songs famous. If you guess, the probability distribution for the number of correct matches is: X (the number correct) 0 1 2 3 What is the probability that you get: a. exactly one correct? b. at least one correct? 74 P(x) 0.333 0.500 0 0.167 75 c. exactly two correct? d. Compute the mean, variance, and standard deviation of this distribution. Omit 8. In a binomial situation n=4 and π=0.25. Determine the probabilities of the following events using the binomial formula. a. x=2 b. x=3 omit 9. In a binomial situation n=5 and π=0.40. Determine the probabilities of the following events using the binomial formula. a. x=1 b. x=2 10. Assume a binomial distribution where n=3 and π=0.60. Omit a. Refer to Appendix A, and list the probabilities for values of x from 0 to 3. b. Determine the mean and standard deviation of the distribution. 11. Assume a binomial distribution where n=5 and π=0.30. Omit a. Refer to Appendix A, and list the probabilities for values of x from 0 to 5. b. Determine the mean and standard deviation of the distribution. Omit 12. A society of investors’ survey found 30 percent of individual investors have used a discount broker. In a random sample of nine individuals, what is the probability: a. exactly two of the samples individuals have used a discount broker? b. exactly four of the them have used a discount broker? c. none of them have used a discount broker? 13. The postal service reports 95 percent of first class mail within the same city is delivered within two days of the time of mailing. Six letters are randomly sent to different locations. Omit a. What is the probability that all sic arrive within two days? 75 76 Omit b. What is the probability that exactly five arrive within two days? c. Find the mean number of letters that will arrive within two days. d. Compute the variance and standard deviation of the number that will arrive within two days. 14. The industry standards suggest that 10 percent of new vehicles require warranty service within the first year. Jones sold 12 Nissans yesterday. Omit a. What is the probability that none of these vehicles requires warranty service? Omit b. What is the probability that exactly one of these vehicles requires warranty service? Omit c. Determine the probability that exactly two of these vehicles require warranty service. d. Compute the mean and standard deviation of this probability distribution. 5.0 Continuous Probability Distributions 5.1 Introduction 5.1.1 Name and sketch the graph of the various probability distributions The following are examples of some of the different types of probability distributions. Continuous Probability Distributions: 1) The normal distribution: (This is the one we will be dealing with) 2) The uniform distribution: 3) The Exponential distribution: 76 77 5.2.1/5.2.2 Discuss and sketch the normal distribution and state the properties of the normal distribution The Normal Distribution Curve:(Bell-Curve) A population which is normally distributed will have a mean located at the center and a curve that is symmetrical (which means that each side is a reflection of each other) The percentages given below stem from the Empircal Rule which states that 68.26% of the data lie within one standard deviation of the mean; 95.46% of the data lie within two standard deviations of the mean and 99.73% of the data lie within three standard deviations from the mean. The normal distribution is a continuous probability distribution that is uniquely determined by its mean (µ ) and standard deviation (σ ). Characteristics of a Normal Probability Distribution A normal distribution is completely described by its mean and standard deviation. This indicates that if the mean and standard deviation are known, a normal distribution can be constructed and its curve drawn. 77 78 The following chart shows three normal distributions, where the means are the same but the standard deviations are different. 78 79 The following chart shows three distributions with different means but identical standard deviations. The following chart shows three distributions with different means and different standard deviations. Examples: 1. A company conducts a test on the lifespan of a battery. For a particular battery, the mean life is 19 hours. The useful life of the battery follows a normal distribution with a standard deviation of 1.2 hours. Answer the following questions. a. About 68% of the batteries will have a lifespan between what two values? b. About 95% of the batteries will have a lifespan between what two values? c. Virtually all of the batteries will have a lifespan between what two values? 79 80 2. The mean of a normal probability distribution is 250; the standard deviation is 20. a. About 95% percent of the observations lie between what values? b. About what percent of the data lies between 230 and 270? c. About what percent of the data lies between 190 and 310? 5.3 The Standard Normal Curve 2.4.4.4 and 2.4.4.5 Define and compute z-scores. (new numbers on outline 2.3.3) The Standard Normal Distribution: This is a normal distribution curve but in a standard normal distribution the mean is zero and the standard deviation is 1. Remember: the standard deviation is a measurement of how much a particular value deviates from the mean. The standard normal distribution can be used for all problems where the normal distribution is applicable. Any normal distribution can be converted into the "standard normal distribution" by using a z value. The z value measures the distance between a particular value of X and the mean in units of the standard deviation. This is how to compute a z-score: z X where: X: value of your random variable : the mean of the distribution of the random variable : the standard deviation of the distribution (So, looking at the formula you can see that the z-score measures how many standard deviations a number is from the mean.) The following illustration demonstrates converting an X value to a standardized z value: 80 81 Examples: 1. A distribution has a mean of 100 and a standard deviation of 10. Calculate the zscores of each of the following. a. 110 b. 80 2. The weekly incomes of shift foremen in the glass industry are normally distributed with a mean of $1000 and a standard deviation of $100. What is the zvalue for a foreman who earns: a. $1150 per week? b. $925 per week? 81 82 5.3.1 Use the standard normal curve to calculate probabilities. You will be using another table(Appendix D) to calculate probabilities using the standard normal curve. It contains a list of z-scores. Obtaining the Probability: (2 STEPS!) 1. To obtain the probability of a value falling in the interval between the variable of interest (X) and the mean, we first compute the z-score. 2. To obtain the probability we refer to the Standard Normal Probability Table (Appendix D) for the associated probability of a given area under the curve. The following is an illustration of how we read the Standard Normal Probability Table for z = 0.12. (SKETCHES HELP!!!) Note: The table gives the probability for the area under the curve from the mean to the z-value. And remember the distribution is symmetrical (50% of values are on the right and 50% are on the left) Examples: 1. A normal population has a mean of 1000 and a standard deviation of 100. a. Compute the z-value associated with 1000. (although you don’t really need to for this one since it’s the mean!) b. Compute the z-value associated with 1100. 82 83 b. What is the probability of selecting a value between 1000 and 1100? d. What is the probability of selecting a value that is less than 1100? e. What is the probability of selecting a value that is greater than 1100? 2. A recent study of the hourly wages of a group of employees showed that the mean hourly salary was $20.50, with a standard deviation of $3.50. If we select a crew member at random, what is the probability the crew member earns: a. Between $20.50 and $24.50 per hour? b. More than $24.50 per hour? 83 84 c. Less than $24.50 per hour? d. Less than $19.00 per hour? e. more than $19.00 per hour? *f. between $19.00 and $24.50? ***g. between $22.50 and $24.50? Worksheet for 5.0 1. The mean of a normal probability distribution is 500; the standard deviation is 10. a. About 68 percent of the observations lie between what values? b. About 95 percent of the observations lie between what two values? c. Practically all of the observations lie between what two values? 2. The mean of a normal probability distribution is 60; the standard deviation is 5. a. About what percent of the observations lie between 55 and 65? b. About what percent of the observations lie between 50 and 70? c. About what percent of the observations lie between 45 and 75? 3. The Kamp family has twins, Rob and Rachel. Both Rob and Rachel graduated college 2 years ago, and each is now earning $50 000 per year. Rachel works in the retail industry, where the mean salary for executives with less than 5 years’ 84 85 experience is $35 000 with a standard deviation of $8 000. Rob is an engineer. The mean salary for engineers with less than 5 years’ experience is $60 000 with a standard deviation of $5000. Compute the z values for both Rob and Rachel. 4. A recent article reported that the mean labour cost to repair a heat pump is $90 with a standard deviation of $22. A company completed repairs on two heat pumps this morning. The labour cost for the first was $75 and it was $100 for the second. Compute z values for each and comment on your findings. 5. A normal population has a mean of 20.0 and a standard deviation of 4.0. a. Compute the z value associated with 25.0. b. What proportion of the population is between 20.0 and 25.0? c. What proportion of the population is less than 18.0? 6. A normal population has a mean of 12.2 and a standard deviation of 2.5. a. Compute the z value associated with 14.3. b. What proportion of the population is between 12.2 and 14.3? c. What proportion of the population is less than 10.0? 7. A recent study of the hourly wages of maintenance crew members for majo airlines showed that the mean hourly salary was $20.50, with a standard deviation of $3.50. If we select a crew member at random, what is the probability the crew member earns: a. between $20.50 and $24.00 per hour? b. more than $24.00 per hour? c. less than $19.00 per hour? 8. The mean of a normal distribution is 400 pounds. The standard deviation is 10 pounds. a. What is the area between 415 pounds and the mean of 400 pounds? b. What is the area between the mean and 395? c. What is the probability of selecting a value at random and discovering that it has a value of less than 395 pounds? 85 86 5.3.1(continued) Use the standard normal curve to calculate probabilities. The following problems are similar to the type done in the last class only they involve a bit more work. For example, the first few examples will require some interpretation to determine the probability. (area under the curve) There are really four types of problems. They are when you are trying to find the following areas under the curve. 1. 2. 3. 4. The following examples will represent each of the types. Examples: 1. A normal population has mean of 10.5 and a standard deviation of 2.5. What is the area under the curve between 9.5 and 10.5? 2. A normal population has a mean of 500 and a standard deviation of 25. What proportion of the data lies above 540? 3. The distribution of weekly incomes for a group of workers follows a normal distribution. The mean is $1000 and has a standard deviation of $100. What is the area under this normal curve between $840 and $1200? (Note: This question could’ve been worded “What is the probability of a weekly salary being between $840 and $1200?”) 86 87 4. A normal distribution has a mean of 1000 and a standard deviation of 100. What is the area under the normal curve between 1150 and 1250? In brief, there are four situations for finding the area under the standard normal distribution. Type 1: To find the area between 0and z(or –z), look up the probability directly in the table. Type 2: To find the area beyond z (or –z), locate the probability of z in the table and subtract the probability from 0.5000. Type 3: To find the area between two points on different sides of the mean, determine the z values and add the corresponding probabilities. Type 4: To find the area between two points on the same side of the mean, determine the z values and subtract the smaller probability from the larger. 5.4 Applications of the Normal Curve 5.4.1 Solve normal distribution problems There is another type of problem that you will be required to do. Notice, up until now, you have always been asked to find the probability,( or it could have been worded that you find the area under the normal curve) and you were given the value for X. In the next type of problem, you will be give the probability(or area under the curve) and you will be required to get your X value. It is a bit like you are working backwards. The Steps involved: 87 88 Step1: Since you are given the probability, you will need to draw a sketch to figure out the area you are dealing with. Step2: Look up the area in your chart to get your z-score Step3: Fill your z-score into the equation z X and solve for your X. Examples: 1. A normal distribution has a mean of 80 and a standard deviation of 5. a. Determine the value above which 5% of the data occurs. b. Determine the value above which 80% of the data occurs. 2. A normal distribution has a mean of 55 and a standard deviation of 7. a. Determine the value below which 70% of the data occurs. b. Determine the value above which 10% of the data occurs. 88 89 3. The mean cost to use a plane follows the normal distribution and has a value of $1500 per hour with a standard deviation of $150. a. What is the operating cost for the lowest 4% of planes? b. What is the operating cost for the highest 12%? c. What is the operating cost for the lowest 60%? Worksheet for 5.3.1 1. A normal distribution has a mean of 50 and a standard deviation of 4. a. Compute the probability of a value between 44.0 and 55.0. b. Compute the probability of a value greater than 55.0. c. Compute the probability of a value between 52.0 and 55.0. 2. A normal population has a mean of 80.0 and a standard deviation of 14.0. a. Compute the probability of a value between 75.0 and 90.0. b. Compute the probability of a value 75.0 or less. c. Compute the probability of a value between 55.0 and 70.0. 3. A cola-dispensing machine is set to dispense on average 7.00 ounces of cola per use.. The standard deviation is 0.10 ounces. The distribution amounts dispensed follows a normal distribution. a. What is the probability that the machine will dispense between 7.10 and 7.25 ounces of cola? b. What is the probability that the machine will dispense 7.25 ounces of cola or more? c. What is the probability that the machine will dispense between 6.80 and 7.25 ounces of cola? 4. The amounts of money requested on home loan applications at a bank follow the normal distribution, with a mean of $ 70 000 and a standard deviation of $20 000. A load application is received this morning. What is the probability: a. The amount requested is $80 000 or more? b. The amount requested is between $65 000 and $80 000? c. The amount requested is $65 000 or more? 5. A normal distribution has a mean of 50 and a standard deviation of 4. Determine the value below which 95 percent of the observations will occur. 89 90 6. A normal distribution has a mean of 80 and a standard deviation of 14. Determine the value above which 80 percent of the values will occur. 7. The amounts dispensed by a cola machine follow the normal distribution with a mean of 7 ounces and a standard deviation of 0.10 ounces per cup. How much cola is dispensed in the largest 1 percent of the cups? 8. The amount requested for home loans follows the normal distribution with a mean of $70 000 and a standard deviation of $20 000. a. How much is requested on the largest 3 percent of the loans? b. How much is requested on the smallest 10 percent of the loans? 9. Assume that the mean hourly cost to operate a commercial airplane follows the normal distribution with a mean of $2100 per hour and a standard deviation of $250. What is the operating cost for the lowest 3 percent of the airplanes? 10. The monthly sales of muffins in the Richmond, Virginia, area follow the normal distribution with a mean of 1200 and a standard deviation of 225. The manufacturer would like to establish inventory levels such that there is only a 5 percent chance of running out of stock. Where should the manufacturer set the inventory levels? 5.5 Normal Curve Approximation to the Binomial Distribution 5.5.1 Use the normal curve to find approximate solutions to binomial calculations We have already done binomial distribution problems before. Remember, you could use the formula or you could simply look up the answer in the binomial distribution chart. In this section, you will be using the normal curve to come up with an approximate answer to binomial distribution problems. So, just like in the last two sections you will need to calculate a z-score and use your normal curve probability chart to come up with your answer. Remember from the binomial distribution section that you know how to calculate the mean AND variance (and hence, standard deviation) using the following formulas. n 2 n (1 ) The following example will help illustrate. 90 91 Examples: 1. Thermostats are manufactured in batches of 6 with a 70% rate that they are acceptable (no defects). Use the normal curve approximation to estimate the probability of getting 4 or less acceptable thermostats. Step 1: Calculate µ and σ2 using the binomial formulas. Step 2: Get the z-score 2. Of a class of 14 students, each has a 50% chance of passing the test. Use the normal curve approximation to determine the probability of exactly 5 or more students passing. 91 92 Worksheet for 5.5 and 6.0 (6.0 is the next section) 1. Of a group of 5 cars, each will have a 70% chance of having no defects. Use the normal curve approximation to estimate the probability of getting 4 or less cars with no defects. 2. Of a group of 6 people, each has an 80% chance of getting the flu. Use the normal curve approximation to estimate the probability of getting 3 or less getting the flu. 3. State what type of sample each of the following is. a. a sample is taken from each of the 12 provinces to represent the whole country b. a sample is taken from Manitoba to represent the whole country c. Every third person at the mall is surveyed d. People are picked at random from the telephone book e. A sample from each school in the Western district is chosen to represent the district f. One school in the district is chosen to represent the whole district. 4. T/F a. There is more variation or dispersion in the sampling distribution of the sample means than in the population. b. When the size of the sample increases, the standard error of the mean decreases. c. The population mean is equal to the mean of the sampling distribution of the sample means. 5. The standard deviation of a population of 130 is 3.2. a. What is the standard deviation of the sampling distribution of sample means if a sample size of 5 is chosen? b. What is the standard error of the mean if a sample size of 6 is chosen? 6. The mean of a population is 28. What will be the mean of the sampling distribution of sample means if a sample size of 3 is chosen? 7. a. The standard deviation of the sampling distribution of sample means is 5. If a sample of 3 was used, what is the standard deviation of the population? b. If the mean of the sampling distribution of sample means is 37, what is the population mean? c. How would the amount of variation (i.e. dispersion) of the sampling distribution of sample means compare with the amount of variation in the population distribution? 92 93 6.0 Sampling Distributions 6.1 Introduction 6.1.1 Define Sampling Sampling is when a part of a population is taken. 6.1.2 Name and describe the various types of sampling. There are four types of probability sampling commonly used - the most widely used is the simple random sample. 1. Simple Random Sample Several ways of selecting a simple random sample are: a) The name or identifying number of each item in the population is recorded on a slip of paper and placed in a box. The slips of paper are shuffled and the required sample size is chosen from the box. b) Each item is numbered and a table of random numbers, such as the one in Appendix E, is used to select the members of the sample. (Refer to text illustration for using a table of random numbers. Note: The starting point is randomly selected.) c) There are many software programs that will randomly select a given number of items from the population. 2. Systematic Random Sampling A random starting point is selected, let's say 39. Then every k th item thereafter, such as every 100th, is selected for the sample. This means the numbers 39, 139, 239, 339, and so on would be part of the sample. 93 94 3. Stratified Random Sampling For example, if our study involved Army personnel, we might decide to stratify the population into (1) generals, (2) other officers, and (3) enlisted personnel. The number selected from each of the three strata could be proportional to the total number in the population for the corresponding strata. Each number of the population can belong to only one of the strata. 4. Cluster Sampling Cluster sampling is often used to reduce cost when the population is scattered over a large geographic area. Suppose your objective is to study household waste collection in a very large city. The first step would be to divide the city into smaller units (perhaps blocks). The units/blocks would be numbered and several of the units/blocks would be selected randomly for inclusion in the sample. Finally, households within each of these units/blocks are randomly selected. 6.2 Selecting a Random Sample 6.2.3 Show how to use a table of random digits to obtain a random sample Example: Suppose you want to select a random sample of 32 workers from a population of 800 employees. Step 1: get a table of random numbers from a computer Step 2: Number the employees. Give each a 3-digit code so each employee has an equal chance of selection. 001: first employee 002: second employee etc 94 95 Since 800 is the largest possible code, discard all 3-digit codes that are bigger than 800 (801-999 and 000) Step 3: Pick an arbitrary starting point on the table of random numbers and begin reading the numbers. You can read the numbers in any direction you chose. In the example of the sheet provided, they went from left to right. Remember to discard any numbers bigger than 800. Also, if any appear twice, discard them. 6.4 Chance Variation among Samples 6.4.1 State the formula for sampling error. Sampling Error: It is the difference between a sample statistic and its corresponding population parameter. A sample of a population has mean of 3.6, but the population mean is 3.8. What is the sampling error? 6.5 The distribution of Sample Means The data below represents the number of TV’s sold by four employees. employee Tim Bill Jill Kim # of TV’s sold 2 3 10 1 a. What is the population mean? b. Construct the sampling distribution of the sample mean for samples of size 2. 95 96 Sample # of TV’s of 2 sold for each employees mean of the sample of 2 employees c. What is the mean of the sampling distribution of sample means (i.e. the standard error of the mean)? d. What observations can be made regarding the mean of the population and the mean of the sampling distribution? e. What observations can be made regarding the spread of scores in the population compared with the spread of scores in the sampling distribution? Conclusions about the sampling distribution of the sample means 1. The mean of the sample means is exactly equal to the population mean. 96 97 2. The dispersion of the sampling distribution of the sample means is narrower than the population distribution. **3. The sampling distribution of the sample means tends to become bell-shaped and to approximate the normal probability distribution. 6.5.1 Show how to determine the distribution of sample means insert photocopies (you won’t have to make one, just interpret it) NOTE: The population mean and the sample mean are the same. 6.5.2 Analyze and Calculate the standard error of the means and state what the error represents The standard error of the mean is actually the standard deviation of the sampling distribution of the sample means. If you know the standard error of the population, you can calculate the standard error of the means using the following formula: where: : standard error of the mean (i.e. standard deviation of the distribution of sample means) : the standard deviation of the population n : the number of observations in each sample So, the standard error of the means represents, on average, how far a sample mean is from the population mean. There are two important things to note about the distribution of sample means: 97 98 1. The mean of the distribution of sample means will be exactly equal to the population mean if we are able to select all possible sample of the same size from a given population. That is: x 2. There will be less dispersion in the sampling distribution of the sample mean than in the population. If the standard deviation of the population is , the standard deviation of the distribution of sample means is . Note that n when the size of the sample is increased, the standard error of the mean decreases. Examples: 1. The standard deviation of a population of 35 marks is 5.7. What is the standard deviation of the sampling distribution of the sample means (i.e. the standard error of the mean) if a sample size of 5 is chosen? 2. The standard error of the mean (i.e. the standard deviation of the sampling distribution of the sample means) is 13 for a distribution, when a sample of 8 is used. What is the standard deviation of the population? 3. The data below represents the number of TV’s sold by four employees. employee Tim Bill Jill Kim # of TV’s sold 4 5 6 7 98 99 a. What is the population mean? b. Construct the sampling distribution of the sample mean for samples of size 2. Calculate the mean of this distribution. c. What is the mean of the sampling distribution? d. What observations can be made regarding the mean of the population and the mean of the sampling distribution? e. What observations can be made regarding the spread of scores in the population compared with the spread of scores in the sampling distribution? 4. The standard deviation of a population of 40 marks is 7.9. a. What is the standard deviation of the sampling distribution of the sample means (i.e. the standard error of the mean) if a sample size of 6 is chosen? b. What is the standard error of the mean if a sample size of 10 is chosen? 6.6 The Central Limit Theorem 6.6.1 State the Central Limit Theorem The Central Limit Theorem states that for large random samples, the sampling distribution of the sample means is close to a normal probability distribution. 6.6.2 Apply the Central Limit Theorem to make predictions about and calculate probabilities for sample means. The following steps are used for using the central limit theorem to calculate the probability for a given sample mean. Step 1: Calculate the z-score for the sample mean, X , using the following formula (if the POPULATION standard deviation, , is known) 99 100 where: X : is the SAMPLE mean : is the population mean : is the population standard deviation n : the sample size OR (if the SAMPLE standard deviation is known) where: X : is the SAMPLE mean : is the population mean s : is the sample standard deviation n : the sample size Step 2: Look up the probability in Appendix D and determine the desired probability using the same methods as before. Examples: 1. A normal population has a mean of 60 and a standard deviation of 12. You select a random sample of 9. Compute the probability that the sample mean is: a. greater than 63 b. less than 56 100 101 c. between 56 and 63 2. A population of 100 with an unknown shape has a mean of 75. You select a sample of 40. The standard deviation of the sample is 5. Compute the probability that the sample mean is: a. less than 74 b. between 74 and 76 c. between 76 and 77 d. greater than 77 Worksheet for 5.5 and 6.0 1. Of a group of 5 cars, each will have a 70% chance of having no defects. Use the normal curve approximation to estimate the probability of getting 4 or less cars with no defects. 2. Of a group of 6 people, each has an 80% chance of getting the flu. Use the normal curve approximation to estimate the probability of getting 3 or less getting the flu. 3. State what type of sample each of the following is. a. a sample is taken from each of the 12 provinces to represent the whole country b. a sample is taken from Manitoba to represent the whole country c. Every third person at the mall is surveyed d. People are picked at random from the telephone book 101 102 e. A sample from each school in the Western district is chosen to represent the district f. One school in the district is chosen to represent the whole district. 4. T/F a. There is more variation or dispersion in the sampling distribution of the sample means than in the population. b. When the size of the sample increases, the standard error of the mean decreases. c. The population mean is equal to the mean of the sampling distribution of the sample means. 5. The standard deviation of a population of 130 is 3.2. a. What is the standard deviation of the sampling distribution of sample means if a sample size of 5 is chosen? b. What is the standard error of the mean if a sample size of 6 is chosen? 6. The mean of a population is 28. What will be the mean of the sampling distribution of sample means if a sample size of 3 is chosen? 7. a. The standard deviation of the sampling distribution of sample means is 5. If a sample of 3 was used, what is the standard deviation of the population? b. If the mean of the sampling distribution of sample means is 37, what is the population mean? c. How would the amount of variation (i.e. dispersion) of the sampling distribution of sample means compare with the amount of variation in the population distribution? 102 103 6.6 The Central Limit Theorem 6.6.1 State the Central Limit Theorem The Central Limit Theorem states that for large random samples, the sampling distribution of the sample means is close to a normal probability distribution. 6.6.2 Apply the Central Limit Theorem to make predictions about and calculate probabilities for sample means. The following steps are used for using the central limit theorem to calculate the probability for a given sample mean. Step 1: Calculate the z-score for the sample mean, X , using the following formula: z X (if the POPULATION standard deviation, , is known) n where: X : is the SAMPLE mean : is the population mean : is the population standard deviation n : the sample size OR z X s (if the SAMPLE standard deviation is known) n where: X : is the SAMPLE mean : is the population mean s : is the sample standard deviation n : the sample size 103 104 Step 2: Look up the probability in Appendix D and determine the desired probability using the same methods as before. Examples: 1. A normal population has a mean of 60 and a standard deviation of 12. You select a random sample of 9. Compute the probability that the sample mean is: a. between 60 and 63 b. greater than 63 c. less than 56 d. between 56 and 63 e. between 50 and 56 2. A population of 100 with an unknown shape has a mean of 75. You select a sample of 40. The standard deviation of the sample is 5. Compute the probability that the sample mean is: a. less than 74 b. between 74 and 77 c. between 76 and 77 d. greater than 77 e. less than 76 104 105 Remember: For finding sample mean probabilities: Step 1: Calculate the z-score Step 2: Use Appendix D (z-score chart) and interpret your answer. Extra Examples: 1. A normal population has a mean of 60 and a standard deviation of 8. A random sample of 9 is taken. a. What is the probability that the sample mean is between 60 and 65? b. What is the probability that the sample mean is between 54 and 60? c. What is the probability that the sample mean is between 54 and 65? 2. A population of unknown shape has a mean of 70. You select a sample of42. The standard deviation of the sample is 5. a. Compute the probability the sample mean is greater than 71. b. Compute the probability the sample mean is less than 68.8. c. Compute the probability the sample mean is greater than 68.8. d. Compute the probability the sample mean is less than 71. 3. A trucking company claims that the mean weight of their delivery trucks when they are fully loaded is 6000 pounds and the standard deviation is 250 pounds. Assume that the population follows the normal distribution. Ninety trucks are randomly selected and weighed. a. What is the probability that the sample mean is between 6020 and 6070 pounds? b. What is the probability that the sample mean is between 5970 and 5980 pounds? c. What is the probability that the sample mean is between 5970 and 6020? d. What is the probability that the sample mean is more than 6020? e. What is the probability that the sample mean is less than 6020? f. What is the probability the sample mean is more than 5970? g. What is the probability that the sample mean is less than 5970? h. What is the probability that the sample mean is between 5970 and 6000? 105 106 More Practice! Worksheet 6.6 1. The mean rent for a one-bedroom apartment in Southern California is $2,200 per month. The distribution of the monthly costs does not follow the normal distribution. In fact, it is positively skewed. What is the probability of selecting a sample of 50 one-bedroom apartments and finding the mean to be at least $1,950 per month? The standard deviation of the sample is $250. 2. According to an IRS study, it takes an average of 330 minutes for taxpayers to prepare, copy, and electronically file a 1040 tax form. A consumer watchdog agency selects a random sample of 40 taxpayers and finds the standard deviation of the time to prepare, copy, and electronically file form 1040 is 80 minutes. a. What assumption or assumptions do you need to make about the shape of the population? b. What is the standard error of the mean in this example? c. What is the likelihood the sample mean is greater than 320 minutes? d. What is the likelihood the sample mean is between 320 and 350 minutes? e. What is the likelihood the sample mean is greater than 350 minutes? 3. Recent studies indicate that the typical 50-year-old woman spends $350 per year for personal-care products. The distribution of the amounts spent is positively skewed. We select a random sample of 40 women. The mean amount spent for those sampled is $335, and the standard deviation of the sample is $45. What is the likelihood of finding a sample mean this large or larger from the specified population? 4. Information from the American Institute of Insurance indicates the mean amount of life insurance per household in the United States is $110,000. This distribution is positively skewed. The standard deviation of the population is now known. 106 107 a. A random sample of 50 households revealed a mean of $112,000 and a standard deviation of $40,000. What is the standard error of the mean? b. Suppose that you selected 50 samples of households. What is the expected shape of the distribution of the sample mean? c. What is the likelihood of selecting a sample with a mean of at least $112,000? d. What is the likelihood of selecting a sample with a mean of more than $1000,000? e. Find the likelihood of selecting a sample with a mean of more than $1000,000 but less than $112,000. 5. The mean age at which men in the United States marry for the first time is 24.8 years. The shape and the standard deviation of the population are both unknown. For a random sample of 60 men, what is the likelihood that the age at which they were married for the first time is less than 25.1 years? Assume that the standard deviation of the sample is 2.5 years. 6. A recent study by the Greater Los Angeles Taxi Drivers Association showed that the mean fare charged for service from Hermosa Beach to the Los Angeles International Airport is $18.00 and the standard deviation is $3.50. We select a sample of 15 fares. a. What is the likelihood that the sample mean is between $17.00 and $20.00? b. What must you assume to make the above calculation? 6.3.3 (new outline addition) Apply the Central Limit Theorem to construct confidence intervals for sample means and sample proportions Confidence Interval for a Population Mean: Normal (z) Statistic What does a confidence interval for a population mean actually mean? When we form a 95% confidence interval, for example, for μ, we usually express out confidence interval with a statement such as, “We can be 95% confident that μ lies between the lower and upper bounds of the confidence interval.” The statement reflects out confidence in the estimation process rather than in the particular interval that is calculated from the sample 107 108 data. We know that repeated application of the same procedure will result in different lower and upper bounds on the interval. Also, we know that 95% of the resulting intervals will contain μ. Large Sample (1-α)% Confidence Interval for μ Where is the z-value with an area α/2 to its right and the standard deviation of the sampled population, and Note: When is unknown and is large (say, interval is approximately equal to . is is the sample size. , the confidence Where is the sample standard deviation Examples: 1. Unoccupied seats on flights cause airlines to lose revenue. Suppose a large airline wants to estimate its average number of unoccupied seats per flight over the past year. To accomplish this, the records of 225 flights are randomly selected, and the number of unoccupied seats is noted for each of the sampled flights. The mean of the sample is 11.6 and the standard deviation of the sample is 4.1. Estimate μ, the mean number of unoccupied seats per flight during the past year, using a 90% confidence interval. 2. A random sample of n measurements was selected from a population with unknown mean μ and known standard deviation σ. Calculate a 95% confidence interval for μ for each of the following situations: a. b. , , , , 108 109 c. , d. , , , Large Sample Confidence Interval for a Population Proportion The fact that is a “sample mean number of successes per trial” allows us to form confidence intervals about in a manner that is completely analogous to that used for large-sample estimation of . Large Sample Confidence Interval for p: Where Note: In order to be able to use this formula, the sample must be a random sample from the population and the sample size n must be large (usually this means that and ) Examples: 1. A polling agency conducts a survey to determine the current consumer sentiment concerning the state of the economy. Suppose that the company randomly samples 484 consumers and finds that 257 are optimistic about the state of the economy. Use a 90% confidence interval to estimate the proportion of all consumers who are optimistic about the state of the economy. Based on the confidence interval, can the company conclude that the majority of consumers are optimistic about the economy? 2. A newspaper reported that the majority of Americans say that Starbucks coffee is overprices. A telephone survey of 1000 American adults found that 730 of them believed the coffee was overpriced. Find and interpret at 95% confidence interval for the proportion. 109 110 6.3.4 Compute sample size for sampling data for an estimation of mean or proportion One of the most important decisions researchers need to make when planning an experiment, is the size of the sample. We show in this section that how to calculate an appropriate sample size for making an inference about a population mean or proportion. It depends strongly on the desired reliability of the experiment. Sample Size Determination for Interval for Confidence In order to estimate with a sampling error (SE) and with confidence, the required sample size is found as follows: Note: The value of is usually unknown. It can be estimated using: either the sample standard deviation, s, if it is known OR where is the range of observations in the population. Also, it is good practice to round the value of upward to ensure that the sample size will be sufficient to achieve the specified reliability. Examples: 1. The manufacturer of official NFL footballs uses a machine to inflate its new balls to a pressure of 13.5 pounds. When the machine is properly calibrated, the mean inflation pressure is 13.5 pounds, but uncontrollable factors cause the pressures of individual footballs to vary randomly from about 13.3 to 13.7 pounds. For quality-control purposes, the manufacturer wishes to estimate the mean inflation pressure to within 0.025 pound of its true value with a 99% confidence interval. What sample size should be used? 110 111 2. Suppose you wish to estimate a population mean correct to within 0.20 with a probability equal to 0.90. You do not know σ, but you know that the observations will range in value between 30 and 34. 3. A study wishes to determine the mean bending strength of imported white wood used on the roof of an ancient Japanese temple. The researchers would like to estimate the true mean breaking strength of the wood to within 4 MPa using a 90% confidence interval. How many pieces of the wood need to be tested if the sample standard deviation of the breaking strengths from the study was 10.9 MPa? The procedure for finding the sample size necessary to estimate a population proportion with a specified sampling error, SE, is determine as follows: Sample Size Determination for Confidence Interval for In order to estimate a binomial probability with sampling error SE and with confidence, the required sample size is found by solving the following equation for : Note: Because the value of the product is unknown, it can be estimated by using the sample fraction of successes, , from a prior sample. Remember that the value of is at its maximum when equals , so you can obtain conservatively large values of by approximating by 0.5 or values close to In any case, you should round the value of obtained upward to ensure that the sample size will be sufficient to achieve the specified reliability. So, if you don’t know p, let it equal 0.5! 111 112 Examples: 1. In each case, find the approximate sample size required to construct a 95% confidence interval for p that has a sampling error of 0.08. a. Assume p is near 0.2. b. Assume you have no prior knowledge about p, but you wish to be certain that your sample is large enough to achieve the specified accuracy for the estimate. 2. A warehouse stores approximately 60 million empty aluminum beer and soda cans. Recently, a fire occurred at the warehouse. The smoke from the fire contaminated may of the cans with blackspot, making them unusable. A statistician was hired to estimate p, the true proportion of cans that were contaminated by the fire. How many aluminum cans should be randomly sampled to estimate p to within 0.02 with 90% confidence? 3. A survey was conducted to determine what “Made in Canada” means to consumers. 64 of 106 shoppers at a mall believe that it implies that all labor and materials are produced in Canada. Suppose the researchers want to increase the sample size in order to estimate the true proportion to within 0.05 of its true value using a 90% confidence interval. Compute the sample size necessary to obtain the desired estimate. Test #2 up to here!!! 112 113 7.0 Hypothesis Testing 7.1 Introduction 7.1.1 Describe the purpose of Hypothesis Testing What is Hypothesis Testing? Hypothesis testing is a statistical procedure which involves a decision-making process for evaluating claims about a certain parameter of a population. As a researcher of data, you may be interested in answering many types of questions. Automobile manufacturers may be interested in determining whether seat belts will reduce the severity of injuries caused by accidents. A ladies' wear store may want to know whether the general public prefers a certain colour in a new line of fashion swim wear. These types of questions can be answered using the methods of hypothesis testing. Hypothesis testing starts with a statement about a population parameter such as the mean. What is a Hypothesis? In statistical analysis we make a claim, that is, state a hypothesis, then follow up with tests to verify the assertion or to determine that it is untrue. Because we utilize statistical inference, is not necessary to measure the entire population; instead, we take a sample from the population to determine whether the empirical evidence from the sample does or does not support the statement concerning the population. As noted, hypothesis testing starts with a statement about a population parameter such as the mean. Example: One statement about the performance of a new model car is that the mean miles per gallon is 30. Another statement is that the mean miles per gallon is not 30. 113 114 Only one of these statements is correct. To test the validity of the assumption (hypothesis) that the meal miles per gallon is 30, we must select a sample from the population, calculate sample statistics, and based on certain decision rules either accept or reject the hypothesis. 7.2 Hypothesis Tests 7.2.1 Name and describe the components of a statistical hypothesis test Five-Step Procedure for Hypotheses Testing When conducting hypothesis tests we actually employ a strategy of "proof by contradiction." We hope to accept a statement to be true by rejecting or ruling out another statement. Statistical hypothesis testing is a five-step procedure: Hypothesis Testing - Step 1 The first step is to state the null and alternate hypotheses. What is the null hypothesis? 114 115 For example, a recent newspaper report made the claim that the mean length of a hospital stay was 3.3 days. You think that the true length of stay is some other length than 3.3 days. The null hypothesis is written Ho: µ = 3.3 It is the statement about the value of the population parameter - in this case the population mean. The null hypothesis is established for the purpose of testing. On the basis of the sample evidence, it is either rejected or not rejected. In other words, it is accepted or rejected. If the null hypothesis is rejected, then we accept the alternate hypothesis. The alternate hypothesis is written H1: µ ≠3.3 There are two other formats for writing the null and alternate hypotheses: Suppose you think that the mean length of stay is greater than 3.3 days. The null and alternate hypothesis would be written: µ = 3.3 H1: µ ≠ 3.3 Ho: Note that in this case the null hypothesis indicates "no change or that is less than 3.3." The alternate hypothesis states that the mean length of stay is greater than 3.3 days. Suppose you think that the mean length of stay is less then 3.3 days. The null and alternate hypothesis would be written: µ ≥ 3.3 H1: µ <3.3 Ho: It is important to remember that no matter how the problem is stated, the null hypothesis will always contain the equal sign. The equality sign will never appear in the alternate hypothesis. One-tailed versus two-tailed test When a direction is expressed in the alternate hypothesis, such as > or <, the test is referred to as being a one-tailed test. When the alternate hypothesis is that of "≠" (not equal to), the test will be a two-tailed test. 115 116 Hypothesis Testing - Step 2 After setting up the null hypothesis and alternate hypothesis, the next step is to state the level of significance. The level of significance is designated α, the Greek letter alpha. If will indicate when the sample mean is too far away from the hypothesized mean for the null hypothesis to be true. When a true null hypothesis is rejected it is referred to as a Type I error. If the null hypothesis is not true, but our sample results indicate that it is, we have a Type II error. Hypothesis Testing - Step 3 Step 3 of the hypothesis testing procedure is to compute the test statistic. What is a test statistic? Which test statistic do I use? This answer to this question is determined by factors such as whether the population standard deviation is known and the size of the sample. The standard normal distribution, the z value, is used 116 117 if the population is normally distributed if the population standard deviation is known and, when the sample size is greater than 30. Hypothesis Testing - Step 4 Formulate the Decision Rule: A decision rule is based on Ho and H1 , the level of significance, and the test statistic. The decision rule is formulated by finding the critical values for z. If we are applying a one-tailed test, there is only one critical value. If we are applying a two-tailed test, there are two critical values. The following diagram illustrates the critical values for a two-tailed test, at the 0.01 level of significance. Since this is a two-tailed test, half of the 0.01 is found in each tail 0.005. The area where Ho is not rejected is therefore 0.99. Since appendix D is based on half of the area under the curve, we locate 0.99/2 = 0.4950 in the body of the table to find the corresponding z critical values = 2.58. 117 118 Therefore, our decision rule is: Reject the null hypothesis and accept the alternate hypothesis if the computed value of z does not fall in the region between -2.58 and +2.58. To find the critical value for a one-tailed test, at the 0.01 level of significance, place the 0.01 of the total area in the upper or lower tail. This means that 0.5000 - 0.01 = 0.4900 of the area is located between the z value of 0 and the critical value. We locate 0.4900 in the body of Appendix D and our decision rule is to reject the null hypothesis if the computed value from the test statistic exceeds 2.33 for an upper-tailed test or is less than -2.33 for a lower tailed test. The following diagrams will illustrate the acceptance and rejection area for an upper-tailed test. Hypothesis Testing - Step 5 Select the Sample and Make a Decision: The final step is to select the sample and compute the value of the test statistic. This value is compared to the critical value, or values, and a decision is made whether to reject to accept the null hypothesis. 118 119 In the following example the critical values for z are -2.58 and +2.58 (a two-tailed test). The computed value of z = 1.55. Since the computed value falls in the acceptance range, we do not reject, we accept the null hypothesis. 7.0 Hypothesis test examples: 1. A company manufactures desks. Their production follows the normal distribution, with a mean of 200 per week and a standard deviation of 16. The president would like to investigate whether the mean number of desks is different from 200 at the 0.01 significance level. A sample accumulated over 150 weeks has a mean of 203.5. Is the president right in assuming that the mean number of desks is different from 200? 2. The rate at which a stock of aspirin is changes each year has a mean of 6.0 and a standard deviation of 0.50. A random sample of 64 aspirin revealed a mean of 5.84. It is suspected that the mean turnover has changed and is no longer 6.0. Use the 0.05 significance level to test the hypothesis that the mean turnover is not 6.0. 3. The mean age of passenger cars in the US is 8.4 years. A sample of 40 cars in the student lots at the University of Tennessee showed the mean age to be 9.2 years. The standard deviation of this sample was 2.8 years. At the 0.1 significance level, can we conclude the mean age is more than 8.4 years for the cars of Tennessee students? 4. The manager of a store wants to find whether the mean unpaid balance is more than $400. The level of significance is set at 0.05. A random sample of 60 unpaid balances revealed the sample mean is $407 and the standard 119 120 deviation is $22.50. Should she conclude that the mean is greater than $400? 5. The mean amount of time spent watching TV per day for eighth graders is 1.6 hours. A sample of 35 eight graders showed the mean number of hours to be 1.3 hours with a standard deviation of 1.0 hours. At the 0.01 significance level, can we conclude that the mean age is less than 1.6 hours? 6. The mean number of hours spent on the phone by employees is said to be 37 with a standard deviation of 2.1. The owner of a company wants to determine whether the mean number of minutes is less than 37. She takes a sample of 43 employees and finds that the mean amount of time spent is 33. Can we conclude that the mean number of minutes is less than 37? (Use the 0.05 significance level.) 7. A town council claims that the mean number of hours citizens spend commuting to work is 28 minutes. A company believes that the mean is not 28 minutes and takes a sample of 50 citizens. They determine that the mean commuting time of the sample is 36 minutes with a standard deviation of 11 minutes. At the 0.01 significance level, can the company conclude that the mean commuting time for the town is different from 28 minutes? Worksheet for 7.0 1. The following information is available. H0: µ = 50 H1: µ ≠ 50 The sample mean is 49, and the sample size is 36. The population follows the normal distribution and the standard deviation is 5. Use the .05 significance level. 2. The following information is available. H0: µ ≤ 10 H1: µ > 10 The sample mean is 12 for a sample of 36. The population follows the normal distribution and the standard deviation is 3. Use the .02 significance level. 120 121 3. A sample of 36 observations is selected from a normal population. The sample mean is 21, and the sample standard deviation is 5. Conduct the following test of hypothesis using the .05 significance level. H0: µ ≤ 20 H1: µ > 20 4. A sample of 64 observations is selected from a normal population. The sample mean is 215, and the sample standard deviation is 15. Conduct the following test of hypothesis using the .03 significance level. H0: µ ≥ 220 H1: µ < 220 For Exercises 5-8: (a) State the null hypothesis and the alternate hypothesis. (b) State the decision rule. (c) Compute the value of the test statistic. (d) What is your decision regarding H0? (e) What is the ρ-value? Interpret it. 5. The manufacturer of the Χ-15 steel-belted radial truck tire claims that the mean mileage the tire can be driven before the tread wears out is 60,000 miles. The Crosset Truck Company bought 48 tires and found that the mean mileage for their trucks is 59,500 miles with a standard deviation of 5,000 miles. Is Crosset’s experience different from that claimed by the manufacturer at the .05 significance level? 6. The MacBurger restaurant chain claims that the waiting time of customers for service is normally distributed, with a mean of 3 minutes and a standard deviation of 1 minute. The quality-assurance department found in a sample of 50 customers at the Warren Road MacBurger that the mean waiting time was 2.75 minutes. At the .05 significance level, can we conclude that the mean waiting time is less than 3 minutes? 7. A recent national survey found that high school students watched an average (mean) of 6.8 DVDs per month. A random sample of 36 college students revealed that the mean number of DVDs watched last month was 6.2, with a standard deviation of 0.5. At the .05 significance level, can we conclude that college students watch fewer DVDs a month than high school students? 8. At the time she was hired as a server at the Grumney Family Restaurant, Beth Brigden was told, “You can average more than $80 a day in tips.” Over the first 35 days she was employed at the restaurant, the mean daily amount of her tips was $84.85, with a standard deviation of $11.38. At the .01 significance level, can Ms. Brigden conclude that she is earning an average of more than $80 in tips? 121 122 7.5 Tests Concerning Means for Small Samples and Tests concerning Proportions This is the method for doing hypothesis tests if you are given s and the sample size, n, is LESS than 30. t distribution: So far we have been talking about the Standard Normal distribution with the sample chosen being greater than 30 but what happens if the sample is 30 or less. If we have a population that is already normally distributed and the population standard deviation, , is unknown then we can pick a sample less than 30 and we can use the t distribution. 1. The degree of freedom of a sample is given by: d.f. = n - 1 2. The t score is given by: t X s n Characteristics of t Distribution The Student's t distribution is also referred to as the t distribution. It is similar to the standard normal distribution in some ways, but quite different in others. From the following chart comparing the two distributions, you will note that the t distribution is flatter and more spread out that the z distribution. 122 123 Note that when t is used as the test statistic instead of the standard normal z distribution: o o The region for which the null hypothesis cannot b e rejected is wider A larger value of t will be required to reject the null hypothesis. A further requirement is that the population from which the sample is obtained should be normal or approximately normal. Hypothesis testing using t-tests: We have seen how we can use the z-score test statistic for hypothesis testing but if the sample that is chosen from the population is 30 or less then we must use the t test statistic. Note: To use the t statistic we must: a) have n ≤ 30 b) have a population that is essentially normal c) is unknown Examples: 1. A manufacturer of computer disk drives monitors the retail prices of its drives in order to gauge the market. For one type of drive the average price is $750, and the manufacturer wishes to know whether the current mean retail price differs from $750. Seventeen retail establishments are sampled, and their current prices for the drive are determined. The mean for the 17 retail price is $732 and the standard deviation is $38. Can we conclude that the mean price differs from the list price of $750? (Use 0.05 for the level of significance.) 123 124 2. In any bottling process, a manufacturer will lose money if the bottles contain more or less than is claimed on the label. Suppose a quality manager for a mustard company is interested in testing whether the mean number of ounces of mustard per family-size bottle differs from the labeled amount of 20 ounces. The manager samples nine bottles, measures the weight of their contents, and finds the sample to have a mean of 19.7 ounces and a standard deviation of 0.3 ounces. Does the sample evidence indicate that the mustard dispensing machine needs adjusting? Test at the 0.05 significance level. 3. An insurance company reports that the mean cost to process a claim is $60. Cost-cutting measures were introduced. To evaluate if the measures worked at sample of 26 claims was taken and found to have a mean of $56.42 with a standard deviation of $10.04. At the 0.01 significance level, is it reasonable to conclude that the mean cost to process a claim is now less than $60? 4. The mean life of a clock battery is 305 days. The battery was changed to make it last longer. A sample of 20 of the changed batteries had a mean life of 311 days with a standard deviation of 12 days. Is it reasonable to conclude that the change increased the mean life of the battery? Use the 0.05 significance level. 5. The mean length of a bar is 43 millimeters. The production supervisor is concerned that the adjustments of the machine producing the bars have changed. To investigate, he takes a sample of 12 bars and they are found to have a mean of 41.5 millimetres with a standard deviation of 43.0 millimetres. Is it fair to conclude that there has been a change in the mean length of the bars? Use the 0.02 significance level. 124 125 Tests Concerning Proportions What is a proportion? For example, we want to estimate the proportion of all home sales made to first time buyers. A random sample of 200 recent transactions showed that 40 were first time buyers. Therefore, we estimate that 0.20, or 20% (40/200), of all sales are made to first time buyers. To conduct a test of hypothesis for proportions, the same assumptions required for the binomial distribution must be met. Each outcome is classified into one of two categories such as, buyers were either first time home buyers or they were not. The number of trials is fixed. In the above case it is 200. Each trial is independent - meaning that the outcome of one trial has no bearing on the outcome of any other. The probability of a success is fixed. In the above example, the probability is 0.20 for all 200 buyers. Again, the five-step procedure for hypothesis testing is used, The test statistic for testing hypotheses about proportions is the standard normal distribution. The computed z is found by: 125 126 Examples: 1. A report shows that 40 percent of people involved in minor traffic accidents this year have been involved in at least one other traffic accident in the last five years. A group decided to investigate this claim, believing it was too large. A sample of 200 traffic accidents this year showed 74 people were also involved in another accident within the last five years. Use the 0.01 significance level to see if you can conclude that a smaller percentage of people than 40 percent are involved in more than one accident. 2. Prior elections in Indiana indicate it is necessary for a candidate for governor to receive at least 80 percent of the vote in the northern section of the state to be elected. The incumbent governor is interested in assessing his chances of returning to office and plans to conduct a survey of 2000 voters. He finds that of the 2000 sampled, 1550 plan on voting for him. Can he conclude that the percentage of votes needed to win is different from 80%? (Use the 0.05 significance level) 3. An urban planner claims that, nationally, 20 percent of all families renting condos move during a given year. A random sample of 200 families renting condos in Dallas revealed that 56 had moved during the past year. At the 0.01 significance level, can we conclude that a larger proportion of condo owners moved in the Dallas area? 4. A TV manufacturer found that 10 percent of its sets needed repair in the first 2 years of operation. In a sample of 50 sets manufactured 2 years ago, 9 needed repair. At the 0.05 significance level, can we conclude that the percent needing repair is larger than the 10 percent cited by the manufacturer? Worksheet for 7.5 (t-test for one sample) 1. A company claims that the mean number of cars sold per day is 10. A random sample of 10 days is chosen, and the sample mean was found to be 12 cars a day and the sample deviation was 3. Use the 0.05 significance level to determine whether we can conclude that the mean number of cars sold is greater than 10. 126 127 2. A government agency claims that 400 people have traffic accidents every day. A sample of 12 days is taken and the mean is found to be 407 and the sample standard deviation is 6. Use the 0.01 significance level to determine whether we can conclude that the mean is different from 400. 3. A district sales manager claims that the sales representatives make an average of 40 sales calls per week on professors. Several reps say that this estimate is too low. To investigate, a random sample of 28 sales representatives reveals that the mean number of calls made last week was 42. The standard deviation of the sample is 2.1 calls. Using the 0.05 significance level, can we conclude that the mean number of calls per salesperson per week is more than 40? 4. The management of White Industries is considering a new method of assembling its golf cart. The present method requires 42.3 minutes, on the average, to assemble a cart. The mean assembly time for a random sample of 24 carts, using the new method, was 40.6 minutes, and the standard deviation of the sample was 2.7 minutes. Using the 0.10 significance level, can we conclude that the assembly time using the new method is faster? 5. A spark plug manufacturer claimed that its plugs have a mean life is different from 22100 miles. Assume that the life of the spark plugs follows the normal distribution. A fleet owner purchased a large number of sets. A sample of 18 sets revealed that the mean life was 23400 miles and the standard deviation was 1500 miles. Is there enough evidence to substantiate the manufacturer’s claim at the 0.05 significance level? Worksheet for 7.5 (proportion test for one sample) 1. A company claims that the 70% of pet owners in a town have dogs. A sample of 100 observations revealed that 75 pet owners have dogs. At the 0.05 significance level, can we conclude that the percentage of pet owners who have dogs is bigger than 70%? 2. A newspaper reported that 40% of people taking a certain drug exhibit side effects. A sample of 120 observations revealed that 36 people exhibited side effects. At the 0.05 significance level, can we conclude that the percentage exhibiting side effects is different from 40%? 127 128 3. The National Safety Council reported that 52 percent of American turnpike drivers are men. A sample of 300 cars travelling southbound on the New Jersey Turnpike yesterday revealed that 170 were driven by men. At the 0.01 significance level, can we conclude that a larger proportion of men were driving on the New Jersey Turnpike than the national statistics indicate? 4. A recent article in USA Today reported that a job awaits only one in three new college graduates. The major reasons given were an overabundance of college graduates and a weak economy. A survey of 200 recent graduates from your school revealed that 80 students had jobs. At the 0.02 significance level, can we conclude that a larger proportion of students at your school have jobs? 5. Chicken Delight claims that 90 percent of its orders are delivered within 10 minutes of the time the order is placed. A sample of 100 orders revealed that 82 were delivered within the promised time. At the 0.10 significance level, can we conclude that less than 90 percent of the orders are delivered in less than 10 minutes? 6. Research at the University of Toledo indicates that 50 percent of the students change their major area of study after their first year in a program. A random sample of 100 students in the College of Business revealed that 48 had changed their major area of study after their first year of the program. Has there been a significant change in the proportion of students who change their major after the first year in their program. Test at the 0.05 level of significance. 7.6 Tests Concerning Differences between the Means of large samples and Tests Concerning the Differences between Proportions Hypothesis Testing: Two Population Means If there are two populations, we can compare two sample means to determine if they came from populations with the same or equal means. For example, a purchasing agent is considering two brands of tires for use on the company's fleet of cars. A sample of 60 Rossford tires indicates the mean useful life to be 45,000 miles. A sample of 50 Maumee tires revealed the useful life to be 48,000 miles. Could the difference between the two sample means be due to chance? 128 129 standard deviations are either known or have been computed from samples whose sizes are greater than 30. The test statistic used is the standard normal distribution and its The assumption is that for both populations, the value is computed as: Examples: 1. Customers at a market have a choice when paying for their groceries. They may check out and pay using the standard cashier check-out, or they can use the new U-Scan procedure where they do it on their own. The store manager would like to know if the mean check-out time using the standard method is longer than using the U-Scan. She gathered the following information. Method used Sample mean standard U-Scan 5.50 minutes 5.30 minutes Sample standard deviation 0.40 minutes 0.30 minutes Sample size 50 100 Can she conclude that the standard method takes longer than the U-Scan method? Use the 0.01 significance level. 2. The owner of a company noticed a difference in the dollar value of the sales between the men and the women who work for her. A sample of 40 days showed that the men sold a mean of $1400 worth of goods with a standard deviation of $200. For a sample of 50 days, the women sold a mean of $1500 worth of appliances per day with a standard deviation of $250. At the 0.05 significance level, can she conclude that the mean amount sold per day is larger for the women? 129 130 3. Karen would like to determine whether there are more units produced on the afternoon shift than on the day shift. A sample of 54 day-shift workers showed that the mean number of units produced was 345, with a standard deviation of 21. A sample of 60 afternoon-shift workers showed that the mean number of units produced was 351, with a standard deviation of 28 units. At the 0.05 significance level, is the number of units produced on the day shift smaller? 4. A company would like to determine whether the average mark of version one of a test is different from the average mark of version two of a test. A sample of 48 people taking test one had a mean mark of 72 percent and a standard deviation of 12. A sample of 56 people taking test two had a mean mark of 65 and a standard deviation of 15. At the 0.01 significance level, can we conclude that the average mark of test one is different from the average mark of test two? Two Population Proportions Often we are interested in whether two population proportions are the same. For example, we want to compare the proportion of rural voters planning to vote for the incumbent governor with the proportion of urban voters. In order to conduct this test, we assume each sample is large enough so that the normal distribution may be used as a good approximation of the binomial. Again, the difference lies in the formula for finding the computed z-value. In this formula we use the "pooled estimate" of the population portion. 130 131 Examples: 1. A department store is interested in whether there is a difference in the proportions of younger and older women who would purchase a type of perfume if it were marketed. There are two independent populations, a population consisting of the younger women and a population consisting of the older women. Each sampled woman will be asked to smell the perfume and state whether should would buy it. A random sample of 100 young women revealed 20 liked it enough to buy it. Similarly, a sample of 200 older women revealed 100 liked it enough to buy it. Using the 0.05 significance level, can we conclude that there is a difference in the proportion of younger women and the proportion of older women in whether they would buy the perfume? 2. Of 150 adults who tried a new peach-flavoured peppermint patty, 87 rated it excellent. Of 200 children sampled, 123 rated it excellent. Using the 0.10 level of significance, can we conclude that there is a significant difference in the proportion of adults and the proportion of children who rate the new flavour excellent? 3. The manufacturer of Advil developed a new version and claimed it to be more effective. To evaluate it, a sample of 200 current users is asked to try it. After a one-month trial, 180 indicated the new drug was more effective. At the same time, a sample of 300 current users is given the old drug but told it is the new formulation (i.e. were given a placebo). From this group, 261 said it was an improvement. At the 0.05 significance level, can we conclude that the new drug is more effective? 4. A company wishes to determine whether a higher proportion of women than men prefer cats to dogs as pets. A sample of 320 men showed that 125 preferred cats to dogs and a sample of 250 women showed that 105 preferred cats to dogs. At the 0.02 significance level, can we conclude that a higher proportion of women than men prefer cats to dogs as pets? 131 132 Worksheet #1 for 7.6 (z-test for 2 samples) 1. A sample of 40 observations is selected from one population. The sample mean is 102 and the sample standard deviation is 5. A sample of 50 observations is selected from a second population. The sample mean is 99 and the sample standard deviation is 6. Can we conclude that the mean of the first population is different from the mean of the second population using the 0.04 significance level? 2. A sample of 65 observations is selected from one population. The sample mean is 2.67 and the sample standard deviation is 0.75. A sample of 50 observations is selected from a second population. The sample mean is 2.59 and the sample standard deviation is 0.66. Can we conclude that the mean of the first population is bigger than the mean of the second population at the 0.08 significance level? 3. The Gibbs Baby Food Company wishes to compare the weight gain of infants using their brand versus their competitor’s. A sample of 40 babies using Gibbs products revealed a mean weight gain of 7.6 pounds in the first three months after birth. The standard deviation of the sample was 2.3 pounds. A sample of 55 babies using the competitor’s brand revealed a mean increase in weight of 8.1 pounds, with a standard deviation of 2.9 pounds. At the 0.05 significance level, can we conclude that babies using Gibbs brand gained less weight? 4. As part of a study of corporate employees, the Director of Human Resources for PNC, Inc. wants to compare the distance traveled to work by employees at their office in downtown Cincinnati with the distance for those in downtown Pittsburgh. A sample of 35 Cincinnati employees showed they travel a mean of 370 miles per month, with a standard deviation of 30 miles per month. A sample of 40 Pittsburgh employees showed they travel a mean of 380 miles per month, with a standard deviation of 26 miles per month. At the .05 significance level, is there a difference in the mean number of miles traveled per month between Cincinnati and Pittsburgh employees? Use the five-step hypothesis-testing procedure. 132 133 Worksheet #2 for 7.6 (2 sample proportion test) 1. A company wishes to determine whether population 1 has a larger proportion of people get married than in population 2. A sample of 100 observations from population 1 indicated that 70 were married. A sample of 150 observations from the second population found that 80 were married. Can we conclude at the 0.05 significance level that the proportion of people married in population 1 is bigger than the proportion married in population 2? 2. The Damon family owns a large grape vineyard in western New York along Lake Erie. The grapevines must be sprayed at the beginning of the growing season to protect against various insects and diseases. Two new insecticides have just been marketed: Pernod 5 and Action. To test their effectiveness, three long rows were selected and sprayed with Pernod 5, and three others were sprayed with Action. When the grapes ripened, 400 of the vines treated with Pernod 5 were checked for infestation. Likewise, a sample of 400 vines sprayed with Action were checked. The results are: Number of Vines Checked (sample size) Number of Infested Vines Pernod 5 400 24 Action 400 40 Insecticide 3. A nationwide sample of influential Republicans and Democrats was asked as a part of a comprehensive survey whether they favoured lowering environmental standards so that highsulfur coal could be burned in coal-fired power plants. The results were: Republicans Democrats Number sampled 1,000 800 Number in favour 200 168 Can we conclude that the proportion of Republicans who favour lowering standards is less than the proportion of Democrats? Use the 0.02 significance level. 7.7 Tests Concerning Differences between Means for Small Samples TWO SAMPLE TEST OF MEANS For a test of hypothesis for the difference between two population means we must make three assumptions: To conduct this test, three conditions must be satisfied: 1. The populations must be normally distributed, or approximately normally distributed. 133 134 2. The populations must be independent. 3. The population variances must be equal. The t statistic is similar to that used for a z statistic for the difference between two population means. However, we must make one additional calculation. The two sample variances must be "pooled" to form a single estimate of the unknown population variances. The value of t is then computed. Pooled Sample Variance: Test Statistic: t X1 X 2 1 2 1 s p n1 n2 Note: The degrees of freedom for a two sample test is found by (n1 + n2 - 2). So, the above formula is used when you are given two samples and both are less than 30 and you are given each sample mean and each sample standard deviation. Examples: 1. There are two different ways to mount an engine on a lawnmower. A sample of 5 mowers was used to test the first procedure and it was found to have a mean time of 4 minutes with a standard deviation of 2.92 minutes. A sample of 6 mowers was used to test the second method and it was found to have a mean time of 5 minutes and a standard deviation of 2.10. Can we conclude that there is a difference between the means of the two methods? (Use the 0.10 significance level.) 134 135 2. A manager wants to compare the number of defective wheelchairs produced on the day shift with the number on the afternoon shift. A sample of 6 day shifts was found to have a mean of 7 and a standard deviation of 1.41. A sample of 8 afternoon shifts was found to have a mean of 10 and a standard deviation of 2.27. At the 0.05 significance level, can we conclude that the afternoon shift makes more defective wheelchairs on average? 3. A budget director would like to compare the travel expenses for the sales staff and the audit staff. A sample of 6 sales staff is found to have a mean of 142.5 and a standard deviation of 12.2. A sample of 7 audit staff is found to have a mean of 130.3 and a standard deviation of 15.8. At the 0.10 significance level, can she conclude that the mean daily expenses for the audit staff are less than the sales staff? 135 136 8.0 Analysis of Variance 8.2.3 Discuss the F-distribution for comparing variances. The F Distribution The test statistic used to compare the sample variances and to conduct the ANOVA test is the F distribution. F Distribution: A continuous probability distribution where F is always 0 or positive. The distribution is positively skewed. It is based on two parameters, the number of degrees of freedom in the numerator and the number of degrees of freedom in the denominator. The major characteristics of the F distribution are: The shape of the distribution is illustrated in the following graph. Note that the shape of the curves change as the degrees of freedom change. 136 137 (omit this F test) 8.1.1 Describe the method of analysis of variance. Comparing Two Population Variances The F distribution is used to test the hypothesis that the variance of one normal population equals the variance of another normal population. The F distribution can also be used to validate assumptions with respect to certain statistical tests. REGARDLESS OF WHICH TEST STATISTIC WE ARE USING, WE STILL USE THE USUAL FIVE-STEP HYPOTHESIS TESTING PROCEDURE. Since we are using a two-tailed test, the significance level is found by halving the confidence level. (Example: At a significance level of 0.10, we find the critical value by looking up 0.10/2 = 0.05. We use n1 - 1 degrees of freedom for the numerator, and n2 1 degrees of freedom for the denominator. 137 138 For one-tailed: Examples: 1. Colin, a stockbroker, reports the mean rate of return on a sample of 10 software stocks is 12.6 percent with a standard deviation of 3.9 percent. The mean rate of return on a sample of 8 utility stocks is 10.9 percent with a standard deviation of 3.5 percent. At the 0.05 significance level, can Colin conclude that there is more variation in the software stocks? 2. A random sample of seven observations from the first population resulted in a standard deviation of 7. A random sample of five observations from the second population showed a standard deviation of 12. At the 0.1 significance level, is there a difference in the variation of the two samples? 3. The standard deviation of the marks for a sample of five females in a class is 6.2 and the standard deviation for a sample of eight males is 4.3. At the 0.05 significance level, can we conclude that the male scores have less variation than the female scores? Worksheet #1 for 7.7 1. A random sample of 10 observations from one population revealed a sample mean of 23 and a sample standard deviation of 4. A random sample of 8 observations from another population revealed a sample mean of 26 and a sample standard deviation of 5. At the 0.05 significance level, is there a difference between the population means? 2. A random sample of 15 observations from the first population revealed a sample mean of 350 and a sample standard deviation of 12. A random sample of 17 observations from the second population revealed a sample mean of 342 and a sample standard deviation of 15. At the 0.10 significance level, is there a difference in the population means? 3. A recent study compared the time spent together by single- and dualearner couples. According to the records kept by the wives during the study, the mean amount of time spent together watching television among the single-earner couples was 61 minutes per day, with a standard deviation of 15.5 minutes. For the dual-earner couples, the mean number of minutes spent watching television was 48.4 minutes, with a standard deviation of 18.1 138 139 minutes. At. The 0.01 significance level, can we conclude that the singleearner couples on average spend more time watching television together? There were 15 single-earner and 12 dual-earner couples studied. Worksheet #2 for 8.0 1. What is the critical F value (step 4) for a sample of six observations in the numerator and four in the denominator? Use a two-tailed test and the 0.10 significance level. 2. What is the critical F value (step 4) for a sample of four observations in the numerator and seven in the denominator? Use a one-tailed test and the 0.01 significance level. 3. A random sample of eight observations from the first population resulted in a standard deviation of 10. A random sample of six observations from the second population resulted in a standard deviation of 7. At the 0.02 significance level, is there a difference in the variation of the two populations? 4. A random sample of five observations from the first population resulted in a standard deviation of 12. A random sample of seven observations from the second population showed a standard deviation of 7. At the 0.01 significance level, is there more variation in the first population? 5. A media research company conducted a study of the radio listening habits of men and women. One aspect of the study involved the mean listening time. It was discovered that the mean listening time for men was 35 minutes per day. The standard deviation of the sample of the 10 men studied was 10 minutes per day. The mean listening time for the 12 women studied was also 35 minutes, but the standard deviation of the sample was 12 minutes. At the 0.10 significance level, can we conclude that there is a difference in the variation in the listening times for men and women? (Definitely do NOT omit this F test! )8.0 continuted: ANOVA - Analysis of Variance 139 140 The F distribution is used to for testing the equality of more than two means using a technique known as analysis of variance. (ANOVA) ANOVA requires the following conditions. 1. The populations being sampled are normally distributed. 2. The populations have equal standard deviations. 3. The samples are randomly selected and are independent. The ANOVA test is used to determine if the various sample means came for a single population or from populations with different means. The sample means are compared through their variances. THE SAME FIVE-STEP HYPOTHESIS TESTING PROCEDURE IS USED. THE DIFFERENCE LIES IN THE TEST STATISTIC. THE TEST STATISTIC IS THE F DISTRIBUTION. 140 141 Analysis of Variance Procedure For an analysis of variance problem the appropriate test statistic is F. The F statistic is the ratio of two variance estimates and is computed as: F= Estimate of the population variance based on the differences between the sample means Estimate of the population variance based on the variation within samples The critical value for F is determined from the F tables found in Appendix G. The ANOVA Table A convenient way of organizing the calculations for F is to put them into a table - referred to as an ANOVA table. Source of Variation Sum of Squares Degrees of Freedom Mean Square Treatments SST k-1 SST/(k- 1) = MST Error SSE n-k SSE/(n - k) = MSE Total SS Total 141 F MST/MSE 142 Keep in mind the fact that the SS total term is the total variation, SST is the variation due to the treatments, and SSE is the variation among the samples. The three values are determined by first calculating SS total an SST, then finding SSE by subtraction. 8.2.4 Apply the F-distribution to help determine whether differences in sample means. 8.0 continued...ANOVA Table Source of variation Treatments Sum of squares SST Degrees of freedom k-1 Error SSE n-k Total SS Total n-1 Mean square F SST/(k1)=MST SSE/(nk)=MSE MST/MSE Examples 1. Use the following sample information to see whether we can conclude that that there is a difference in the means of the three groups. Use the 0.05 significance level. Treatment 1 4 2 6 3 5 Treatment 2 3 9 5 142 Treatment 3 4 3 5 6 143 143 144 Source of variation SS df Mean square F Treatments Error 2. A real estate developer is considering investing in a shopping mall somewhere on the outskirts of town. Three different parcels of land are being evaluated. Of particular importance is the income in the area surrounding the proposed mall. A random sample of four families is selected near each proposed mall. Following are the sample results. At the 0.05 significance level, can the developer conclude that there is a difference in the mean income at the three locations? 144 145 Area 1 (in thousands $) 64 Area 2 (in thousands $) 74 Area 3 (in thousands $) 75 68 71 80 70 69 76 60 70 78 145 146 146 147 Source of variation SS df Mean square F Treatment Error Worksheet for ANOVA 1. The following is sample information for three different treatments. Can we conclude at the 0.05 significance level that the treatment means are not equal? Treatment 1 9 7 11 9 12 10 Treatment 2 13 20 14 13 Treatment 3 10 9 15 14 15 2. A manager of a computer software company wishes to study the number of hours senior executives spend at their computers by type of industry. The manager selected a sample of five executives from each of the three industries. At the 0.05 significance level, can she conclude there is a difference in the mean number of hours spent per week by industry? Banking 12 10 10 12 10 Retail 8 8 6 8 10 147 Insurance 10 8 6 8 10 148 8.3 Two-way analysis of variance (ANOVA) 8.3.1 Analyze two-way analysis of variance (ANOVA) tables where two factors may affect the sample means You are responsible for interpreting and analyzing results tables, but you will not have to come up with the data yourself! When we have two factors with at least two levels and one or more observations at each level, we say we have a two-way layout. We say that the two-way layout is crossed when every level of Factor A occurs with every level of Factor B. With this kind of layout we can estimate the effect of each factor (Main Effects) as well as any interaction between the factors. Like testing in the one-way case, we are testing that two main effects and the interaction are zero. The two-way crossed ANOVA is useful when we want to compare the effect of multiple levels of two factors and we can combine every level of one factor with every level of the other factor. If we have multiple observations at each level, then we can also estimate the effects of interaction between the two factors. Example: Let's assume that we want to test if there are any differences in pin diameters for a part due to different types of coolant. We still have five different machines making the same part and we take five samples from each machine for each coolant type to obtain the following data. Machine 1 Machine 2 Machine 3 Machine 4 Machine 5 Coolant 1 Coolant 2 0.125, 0.127, 0.125, 0.126, 0.128 0.124, 0.128, 0.127, 0.126, 0.129 0.118, 0.122, 0.120, 0.124, 0.119 0.116, 0.125, 0.119, 0.125, 0.120 0.123, 0.125, 0.125, 0.124, 0.126 0.122, 0.121, 0.124, 0.126, 0.125 0.126, 0.128, 0.126, 0.127, 0.129 0.126, 0.129, 0.125, 0.130, 0.124 0.118, 0.129, 0.127, 0.120, 0.121 0.125, 0.123, 0.114, 0.124, 0.117 We can summarize the analysis results in an ANOVA table as follows: 148 149 Source Sum of Squares Deg. of Freedom Mean Square F0 machine 0.000303 4 0.000076 8.8 coolant 0.00000392 1 0.00000392 0.45 interaction 0.00001468 4 0.00000367 0.42 residuals 0.000346 40 0.0000087 corrected total 0.000668 49 By dividing the mean square for machine by the mean square for residuals we obtain an F0 value of 8.8 which is greater than the critical value of 2.61 based on 4 and 40 degrees of freedom and a 0.05 significance level. Likewise the F0 values for Coolant and Interaction, obtained by dividing their mean squares by the residual mean square, are less than their respective critical values of 4.08 and 2.61 (0.05 significance level). So, you’ll have to remember that the degrees of freedom for the first row is: I-1 for the numerator and IJ(K-1) for the denominator And for the second row, it’s J-1 for the numerator and IJ(K-1) for the denominator And for the third rwo, it’s (I-1)(J-1) for the numerator and IJ(K-1) for the denominator Where I is the number of rows (i.e. values) for each type of coolant, J is the number of columns (i.e. types of coolants) and K is the number of values for each! From the ANOVA table we can conclude that machine is the most important factor and is statistically significant. Coolant is not significant and neither is the interaction. These results would lead us to believe that some tool-matching efforts would be useful for improving this process. Extra Example! Suppose you want to determine whether the brand of laundry detergent used and the temperature affects the amount of dirt removed from your laundry. To this end, you buy two different brand of detergent (“ Super” and “Best”) and choose three different temperature levels (“cold”, “warm”, and “hot”). Then you divide your laundry randomly into 6×r piles of equal size and assign each r piles into the 149 150 combination of (“Super” and “Best”) and (”cold”,”warm”, and “hot”). In this example, we are interested in testing Null Hypotheses H0D : The amount of dirt removed does not depend on the type of detergent H0T : The amount of dirt removed does not depend on the temperature Super Best Cold 4, 5, 6, 5 6, 6, 4, 4 Warm 7, 9, 8, 12 13, 15, 12, 12 Hot 10, 12, 11, 9 12, 13, 10, 13 Here are the results: Analysis of Variance Table Response: wash Df Sum Sq Mean Sq deter 1 20.167 20.167 water 2 200.333 100.167 deter:water 2 16.333 8.167 Residuals 18 37.000 2.056 F value 9.8108 48.7297 3.9730 Pr(>F) 0.005758 ** 5.44e-08 *** 0.037224 * Determine the proper critical values using the 0.05 significance level to draw your conclusion! Test 3 up to here! 150 151 9.0 Chi-Square distribution 9.1 Introduction 9.1.1 Describe the properties and uses of the Chi-Square Distribution. The Chi-Square Distribution So far we have used the standard normal z, t, and F distributions as the test statistics. In Chapter 13 we will learn how and when to use the Chi-Square as the test statistic. The chi-square is similar to the t and F distributions in that there is a family of distributions - each has a different shape depending on the number of degrees of freedom. As the illustration shows, when the number of degrees of freedom is small, the distributions positively skewed, but as the number of degrees of freedom increases it becomes symmetrical and approaches the normal distribution. Chi-square is based on squared deviations between an observed frequency and an expected frequency - therefore, it is always positive. 151 152 9.2 Goodness of Fit Test 9.2.1 Describe the purpose of a goodness of fit test Goodness-of-Fit Tests In the goodness-of-fit test the distribution is used to determine how well an observed set of observations fits an expected set of observations. Goodness-of-fit test: A nonparametric test involving a set of observed frequencies and a corresponding set of expected frequencies. The purpose of the goodness-of-fit test is to determine if there is a statistical difference between the two sets of data - one which is observed and the other expected. Is the difference due to chance, or can we conclude there is a significant difference between the two values. NOTE: Again, the same systematic five-step hypothesis testing procedure is followed in our solution. We begin by denoting f0 as the observed set of frequencies in a particular category and as the expected frequency in a particular category. fe NOTE: A category is referred to as a cell. Step 1: State the null and alternate hypotheses: Step 2: Select the Level of Significance - This is the probability of committing a Type I error. Step 3: Select the test statistic is the chi-square statistic. Step 4: Formulate the decision rule. Find the critical value of . This critical value is found in the Appendix H, found by locating the number of degrees of freedom in the left column and moving horizontally to the right to read the value associated with the level of significance. 152 153 Step 5: Compute the value of the Chi-square and make your decision. Page 443 of your text illustrates the procedure for computing the value. It is not necessary that the expected frequencies be equal to apply the goodness-of-fit test. The text illustrates the case of unequal frequencies and also gives a practical use of chi-square. Examples: 1. A student sells baseball cards for a day. At the end of the day she records the sales of the six types of cards in a chart as show below. Player Tom Seaver Nolan Ryan Ty Cobb George Brett Hank Aaron Johnny Beach Cards Sold 13 33 14 7 36 17 At the 0.05 significance level, can she conclude the sales are not the same for each player? 2. A human resources manager records the number of sick days over a week. The following data was gathered. Day of the week Monday Tuesday Wednesday Thursday Friday Saturday Number absent 12 9 11 10 9 9 At the 0.01 significance level, can she conclude that there is no difference in the absenteeism throughout the six-day workweek? 153 154 9.3 Test of Independence 9.3.1 Perform the Chi-Square Test to determine whether two classifications of the same data are independent of each other Contingency Tables The distribution is also used to determine if there is a relationship between two or more criteria of classifications. For example, we may be interested in whether or not there is a relationship between job advancement within a company and the gender of the employee. Contingency Table: A table made up of rows and columns. Each box is referred to as a cell. The usual five-step hypothesis testing procedure is followed. The expected frequency, fe , is computed by the formula: fe (rowtotal)(columntotal ) grandtotal The number of degrees of freedom used to find the critical value for is : df = (number of rows - 1)(number of columns - 1) There is a limitation to the use of the distribution The value of fe should be at least 5 for each cell (box). This requirement is to prevent any cell from carrying an inordinate amount of weight and causing the null hypothesis to be rejected. Examples: 1. A Correction Agency is investigating whether those released from prison show a different adjustment if they return to their hometown or is they go elsewhere to live. In other words, they would like to know whether there is a relationship between adjustment to civilian life and place of residence. The data below was gathered. Using the 0.01 significance level, determine if a relationship exists. Live in hometown Live elsewhere Outstanding 27 Fair 35 Unsatisfactory 33 13 15 27 154 155 2. A social scientist sampled 140 people and classified them according to income level and whether or not they played a lottery in the last month. The info is given below. Can we conclude that playing the lottery is related to income level? Use the 0.05 significance level. Played the lottery in the last month Did not play the lottery in the last month High income Low income 21 46 19 14 Worksheet for 9.0 1. In a particular chi-square goodness-of-fit test there are four categories and 200 observations. Use the .05 significance level. a. How many degrees of freedom are there? b. What is the critical value of chi-square? 2. In a particular chi-square goodness-of-fit test there are six categories and 500 observations. Use the .01 significance level. a. How many degrees of freedom are there? b. What is the critical value of chi-square? 3. The null hypothesis and the alternate are: H0: The cell categories are equal. H1: The cell categories are not equal. Category A B C 155 f0 10 20 30 156 a. State the decision rule, using the .05 significance level. b. Compute the value of chi-square. c. What is your decision regarding H0? 4. The null hypothesis and the alternate are: H0: The cell categories are equal. H1: The cell categories are not equal. f0 10 20 30 20 Category A B C D 5. Classic Golf, Inc. manages five courses in the Jacksonville, Florida, area. The Director wishes to study the number of rounds of golf played per weekday at the five courses. He gathered the following sample information. Day Monday Tuesday Wednesday Thursday Friday Rounds 124 74 104 98 120 6. The director of advertising for the Carolina Sun Times, the largest newspaper in the Carolinas, is studying the relationship between the type of community in which a subscriber resides and the section of the newspaper he or she reads first. For a sample of readers, she collected the following sample information. National News Sports Comics City 170 124 90 Suburb 120 112 100 Rural 130 90 88 156 157 At the .05 significance level, can we conclude there is a relationship between the type of community where the person resides and the section of the paper read first? 7. The Quality Control Department at Food Town, Inc., a grocery chain in upstate New York conducts a monthly check on the comparison of scanned prices to posted prices. The chart below summarizes the results of a sample of 500 items last month. Company management would like to know whether there is any relationship between error rates on regular priced items and specially priced items. Use the .01 significance level. Regular Advertised Price Special Price Undercharge 20 10 Overcharge 15 30 200 225 Correct Price 157 158 9.2.1 (in new outline) Perform the chi-square test to determine whether more than two population proportions can be considered equal Remember:The chi-square independence test is used to find out whether there is an association between a row variable and column variable in a contingency table constructed from sample data. The null hypothesis is that the variables are not associated; in other words, they are independent. The alternative hypothesis is that the variables are associated, or dependent. Chi-Square Test for Homogeneity of Proportions: In a chi-square test for homogeneity of proportions, we test the claim that different populations have the same proportion of individuals with some characteristic. Note: The chi-square test for independence is a test regarding a sample from a single population. We now discuss a second type of chi-square test, which can be used to compare the population proportions from different populations. (same method, so this is just an extension of the last section!) Example: A drug manufacturer makes a drug (Zocor) that is meant to reduce the level of LDL cholesterol, while increasing the level of HDL cholesterol. In clinical trials of the drug, patients were randomly divided into three groups, Group 1 received Zocor, group 2 received a placebo, a group 3 received cholestyramine, a currently available drug. The table below contains the number of patients in each group who did and did not experience abdominal pain as a side effect. Number of people who experienced abdominal pain Number of people who did not experience abdominal pain Group 1 (Zocor) Group 2 (Placebo) Group 3 (Cholestyramine) 51 5 16 1532 152 163 Is there evidence to indicate that the proportion of subjects in each group who experienced abdominal pain is different at the α = 0.01 level of significance? 158 159 10.0 Linear Regression and Correlation 10.1 Introduction 10.1.1 Describe the applications of linear regression. The main purpose of linear regression is firstly to see if there is a linear relationship between two variables, and secondly to predict the value of one variable given some particular value of the other variable. To start our discussion we must introduce the concept of scatter diagrams. 10.1.2 Distinguish between independent and dependent variables 10.2 Scatter Diagrams 10.2.1 Construct scatter diagrams to determine if two variables are related. The purpose of correlation analysis is to find out how strong the relationship is between two variables. On way of looking at the relationship between two variables is to portray the information in a scatter diagram. The values of the independent variable are portrayed on the horizontal axis (X-axis) and the dependent variable along the vertical axis (Y-axis). The scatter diagram provides a visual graphical display of the "scatter" of the data and whether or not there appears to be a linear relationship. Example: Construct a scatter gram using the data below. x 4 5 6 7 y 6 7 9 10 159 160 10.2.2 What is correlational analysis? The purpose of correlation analysis is to find out how strong the relationship is between two variables. We are often interested in whether two variables exhibit a linear correlation. One way to tell if there is a linear correlation is to construct a scatter diagram and if it takes the shape of a line, then there exists some degree of linear correlation. 10.3 The Coefficient of Correlation 10.3.1 Calculate the coefficient of correlation and use it to determine the strength of linear relationships between variables. The Coefficient of Correlation A measure of the linear (straight-line) strength of the association between two variables is given by the coefficient of correlation. It is also called Pearson's product moment correlation coefficient or Pearson's r - after its founder Karl Pearson. This information is summarized in the following charts: Perfect Negative Correlation: 160 161 Perfect Positive Correlation: The formula to compute the coefficient of correlation is: r ( X X )(Y Y ) (n 1) s x s y The coefficient of correlation requires that both variables be at least of interval scale. The degree of strength of the relationship is not related to the sign (direction + or -) of the coefficient of correlation. o For example, an r value of -0.60 represents the same degree of correlation as that of +0.60. 161 162 NOTE: The following is the formula for s (SAMPLE standard deviation) It is slightly different than the population standard deviation formula you used before. s ( X X ) 2 n 1 Where s: sample standard deviation X: the data value X : the sample mean n: the sample size Example 1: Calculate the coefficient of correlation for the following variables. (Formulas that will be needed: s r ( X X )(Y Y ) ) ( X X ) 2 n 1 (n 1) s x s y x y 4 10 5 2 6 3 162 163 Example 2: Calculate the coefficient of correlation for the following variables. (Formulas that will be needed: s r ( X X )(Y Y ) ) ( X X ) 2 n 1 (n 1) s x s y x y 2 1 4 5 9 10 11 6 10.4 The Reliability of Correlation 10.4.1 Determine the reliability of the coefficient of correlation through hypothesis testing. Test of Significance of the Correlation Coefficient A test of significance for the coefficient of correlation may be used to determine if the computed r could have occurred in a population in which the two variables are not related. Is the correlation in the population zero? For a two-tailed test the null hypothesis and the alternate hypothesis are written as follows: ρ = 0 (The correlation in the population is zero) H1 : ρ≠ 0 (The correlation in the population is different from zero) H0 : 163 164 The Greek lower case rho, ρ , represents the correlation in the population. The null hypothesis is that there is no correlation in the population, and the alternate is that there is a correlation. The test statistic follows the t distribution with n - 2 degrees of freedom. Examples: 1. A sample of 25 mayoral campaigns in cities with populations bigger than 50000 showed that the correlation between the percent of the vote received and the amount spent on the campaign by the candidate was 0.43. At the 0.05 significance level, is there is no correlation between the variables? 2. An airline selects a random sample of 25 flights and found that the correlation between the number of passengers and the total weight, in pounds, of luggage stored in the luggage department is 0.94. Using the 0.05 significance level, can we conclude that there is a positive correlation between the two variables? 3. A company does a survey of 26 people and finds that there is a correlation between video game playing and television watching. The correlation is found to be -0.67. Using the 0.025 significance level, can we conclude that there is a negative correlation between the two variables? 164 165 Worksheet for 10.0 and 10.5 (to come) 1. The following sample observations were randomly selected. X Y 4 4 5 6 3 5 6 7 10 7 a. Determine the coefficient of correlation. b. Determine the regression equation. (will do in 10.5) c. Determine the value of Y ' when X is 5. (10.5) d. Determine the standard error of estimate. (10.5) e. Determine the 0.90 confidence interval for the mean predicted when X=5. (10.5) 165 166 2. The following sample observations were randomly selected. X Y 4 10 5 5 7 1 10 2 a. Determine the coefficient of correlation. b. Determine the regression equation. (will do in 10.5) c. Determine the value of Y ' when X is 1. (10.5) d. Determine the standard error of estimate. (10.5) e. Determine the 0.95 confidence interval for the mean predicted when X=1. (10.5) 3. A random sample of 12 paired observations indicated a correlation of 0.32. Can we conclude that the correlation in the population is greater than zero? Use the 0.05 significance level. 4. A random sample of 15 paired observations has a correlation of -0.46. Can we conclude that the correlation in the population is less than zero? Use the 0.05 significance level. 166 167 5. A refining company is studying the relationship between the pump price of gasoline and the number of gallons sold. For a sample of 20 stations last Tuesday, the correlation was 0.78. At the 0.01 significance level, is the correlation in the population greater than zero? 6. A study of 20 worldwide financial institutions showed the correlation between their assets and pre-tax profit to be 0.86. At the 0.05 significance level, can we conclude that there is positive correlation in the population? 10.5 Linear Regression 10.5.1/10.52 Describe the method of least square for developing a simple linear regression model/Calculate a linear regression line and use it to predict the value of one variable when given the other. Regression Analysis The equation for a straight line is used to estimate Y based on X and is referred to as the regression equation. The equation for a straight line is: Y = A + Bx. The technique used to develop the equation for the line and make these predictions is called regression analysis. Purpose: Procedure: 167 168 The linear relationship between two variables is given by the regression equation: These can also be written as: a Y bX sy br sx (omit) Least Squares Principle In the regression equation: Y' = a + bX, the value of a is the Y intercept and b is the regression coefficient. These two values are developed mathematically using the least squares principle. least squares principle: Determining a regression equation by minimizing the sum of the squares of the vertical distances between the actual Y values and the predicted values of Y'. 168 169 There is only one line for which SSE is a minimum. This line is the least squares line, the regression line, or the least squares prediction equation. The methodology used to obtain this line is called the method of least squares. Examples: 1. Suppose an appliance store conducts a 5-month experiment to determine the effect of advertising on sales revenue. The results are shown in the table below. month Advertising expenditure, x (in $100s) Sales revenue, y, (in $1000s) Month 1 1 1 Month 2 2 1 Month 3 3 2 Month 4 4 2 Month 5 5 4 a. Determine the regression equation for the following data. b. Predict the sales revenue when advertising expenditure is $200 (i.e., when x=2). c. Give practical interpretations for the y-intercept (a) and the slope (b) of the line. 169 170 (omit)d. Show that the sum of the errors (SE) equals 0. (omit)e. Calculate the sum of the squared errors (SSE) and state its significance for the regression model. 2. An investigation of the properties of bricks used to line aluminum smelter pots was published in a journal. Six different commercial bricks were evaluated. The life length of a smelter pot depends on the porosity of the brick linking (the less porosity, the longer the life); consequently, the researchers measured the apparent porosity of each brick specimen, as well as the mean pore diameter of each brick. The data was given in the accompanying table. brick A B C D E F Mean pore diameter (micrometers) 12.0 9.7 7.3 5.3 10.9 16.8 Apparent porosity (%) 18.8 18.3 16.3 6.9 17.1 20.4 a. Find the least squares line relating porosity (y) to mean pore diameter. (x). b. Predict the apparent porosity percentage for a brick with a mean pore diameter of 10 micrometers. c. Give practical interpretations for the y-intercept (a) and the slope (b) of the line. (omit) d. Show that the sum of the errors (SE) equals 0. (omit) e. Calculate the sum of the squared errors (SSE) and state its significance for the regression model. 170 171 3. Researchers at a company investigated the effect of tablet surface area and volume on the rate at which a drug is released in a controlled-release dosage. Six similarly shaped tablets were prepared with different weights and thicknesses, and the ratio of surface area to volume was measured for each. Using a dissolution apparatus, each tablet was placed in 900 milliliters of deionized water, and the diffusional drug release rate (percentage of drug released divided by the square root of time) was determined. The experimental data are listed in the table. Surface area to volume (mm2/mm3) 1.50 1.05 0.90 0.75 0.60 0.65 Drug release rate (% released/time) 60 48 39 33 30 29 a. Find the least squares line relating drug release rate (y) to surface to volume ratio (x). b. Predict the drug release rate for a tablet that has a surface area/volume ratio of 0.50. c. Give practical interpretations for the y-intercept (a) and the slope (b) of the line. (omit) d. Show that the sum of the errors (SE) equals 0. (omit) e. Calculate the sum of the squared errors (SSE) and state its significance for the regression model. 171 172 10.6 Standard error estimate 10.6.1 Compute the standard error estimate and use it to estimate the predictability of the regression line. The Standard Error of Estimate The predicted value of Y' will rarely be exactly the same as the actual Y value. We expect some prediction error. One measure of this error is called the standard error of estimate. It is written sy.x A small standard error of estimate indicates that the independent variable is a good predictor of the dependent variable. It is similar to the standard deviation. Linear regression is based on these four assumptions: 172 173 Examples: 1. Determine the standard error estimate for the following data. (Note, the data is from #1) month Advertising expenditure, x (in $100s) Sales revenue, y, (in $1000s) Month 1 1 1 Month 2 2 1 Month 3 3 2 Month 4 4 2 Month 5 5 4 2. Determine the standard error estimate for the following data. (Note, the data is from #2) brick Mean pore diameter Apparent porosity (%) (micrometers) A 12.0 18.8 B 9.7 18.3 C 7.3 16.3 D 5.3 6.9 E 10.9 17.1 F 16.8 20.4 173 174 3. Determine the standard error of estimate for the following data. (Note, the data is from #2 in 10.5.1/10.5.2) Surface area to volume (mm2/mm3) 1.50 1.05 0.90 0.75 0.60 0.65 Drug release rate (% released/time) 60 48 39 33 30 29 10.6.2 Determine confidence intervals for regression estimates. Establishing a Confidence Interval for Y The standard error is also used to set confidence intervals for the predicted value of Y'. When the sample size is large and the scatter about the regression line is approximately normally distributed, then the following relationships can be expected: Y' ± 1sy.x encompasses about 68% of the observed values. Y' ± 2sy.x encompasses about 95.5% of the observed values. Y' ± 3sy.x encompasses about 99.7% of the observed values. Two types of confidence intervals may be developed. For the mean value of Y' for a given value of X . For an individual value of Y' for a given value of X - called a prediction interval. The explain the difference: Suppose we are predicting the salary of management personnel who are 40 years old. In this case we are predicting the mean salary of all management personnel age 40. However, if we want to predict the salary of a particular manager who is 40, then we are making a prediction about a particular individual. The formula for the confidence interval for the mean value of Y' for a given X is: The confidence interval for the mean value of Y for a given value of X is given by: Y ' ts yx 1 ( X X )2 n ( X X )2 174 175 Where: Y ' is the predicted value for any selected X value. X is any selected value of X X is the mean of the X’s N is the number of observations s y x is the standard error of estimate t is the value of t from appendix F with n-2 degrees of freedom Examples: 1. a. Determine the 0.95 confidence interval for the mean predicted when X=2. (Note, the data is from #1 in 10.5.1/10.5.2) b. What does it represent? month Advertising expenditure, x (in $100s) Sales revenue, y, (in $1000s) Month 1 1 1 Month 2 2 1 Month 3 3 2 Month 4 4 2 Month 5 5 4 175 176 2. Determine the 0.90 confidence interval for the mean predicted when X=10 for the following data. (Note, the data is from #2 in 10.5.1/10.5.2) brick A B C D E F Mean pore diameter (micrometers) 12.0 9.7 7.3 5.3 10.9 16.8 Apparent porosity (%) 18.8 18.3 16.3 6.9 17.1 20.4 3. Determine the 0.99 confidence interval for the mean predicted when X=0.50 for the following data. (Note, the data is from #3 in 10.5.1/10.5.2) Surface area to volume (mm2/mm3) 1.50 1.05 0.90 0.75 0.60 0.65 Extra Example for 10.0 and 10.5 x 2 4 9 11 Drug release rate (% released/time) 60 48 39 33 30 29 y 1 5 10 14 a. Find the coefficient of correlation, r. b. Find the regression equation that fits the data. c. Predict the value of y, when x = 10. d. Calculate the standard error of estimate. e. Calculate the 90% confidence interval for x = 10. 176 177 11.0 Multiple Regression 11.1 Compute the coefficients of a multiple regression line and use the equation to predict the value of a dependent variable when given the value of independent variables/11.1.1 Interpret the coefficients of the multiple regression equation When there are several independent variables, you can extend the simple linear regression model that we did in unit 10. The following equation defines the multiple regression model with two independent variables Where intercept slope of with variable , holding variable constant slope of with variable , holding variable constant Examples: 1. A graphing calculator was used to determine the values of the regression coefficients for a multiple regression equation. The two variables used were: = price of an OmniPower bar (in cents) for store monthly in-store promotional expenditures (in dollars) for store And these independent variables were used to calculate: predicted monthly sales of OmniPower bars (in # of bars) for store The computed values for the regression coefficients are: 177 178 a. What is the multiple regression equation? b. Interpret the Y intercept. c. Interpret . d. Interpret . e. Use your multiple regression equation to predict the sales for a store charging 79 cents during a month in which promotional expenditures are $400. 178 179 Example 2: A marketing analyst for a shoe manufacturer is considering the development of a new brand of running shoes. The marketing analyst wants to determine which variables to use in predicting durability. Two independent variables under consideration are a measurement of the forefoot shockabsorbing capability, and , a measurement of the change in impact properties over time. The dependent variable, , is a measure of the shoe’s durability after a repeated impact test. A random sample of 15 types of currently manufactured running shoes was selected for testing, and the following regression coefficients were found: a. What is the multiple regression equation? b. Interpret the Y intercept. c. Interpret . d. Interpret . e. Use your multiple regression equation to predict the shoe’s durability if the forefoot shock-absorbing capability is 1.5 units and the measurement of the change in impact properties over ti me is 2.1 units. 179 180 11.1.2 Conduct a test of hypothesis to determine if the regression coefficients differ from zero Testing for the Slope in Multiple Regression Where: slope of variable with holding constant the effects of all other independent variables standard error of the regression coefficient test statistic for a distribution with freedom degrees of number of independent variables in the regression equation hypothesized value of the population slope for variable , holding constant the effects of all other independent variables Examples: 1. a. Determine whether variable (amount of promotional expenditures) has a significant effect on sales in the OmniPower bar example from example 1 in Section 11.1 if the standard error of was determined to be 0.6852. There were 34 stores sampled for data. Use the 0.05 level of significance. (i.e. Determine whether is different from zero at the 0.05 level of significance.) 180 181 Recall that: = price of an OmniPower bar (in cents) for store monthly in-store promotional expenditures (in dollars) for store predicted monthly sales of OmniPower bars for store The computed values for the regression coefficients are: b. Determine whether there is evidence that the slope of sales with price (i.e. is different from zero at the 0.05 level of significance if the standard error for b1 is 6.8522. 181 182 2. Recall example #2 in section 11.1: Example 2: A marketing analyst for a shoe manufacturer is considering the development of a new brand of running shoes. The marketing analyst wants to determine which variables to use in predicting durability. Two independent variables under consideration are a measurement of the forefoot shock-absorbing capability, and , a measurement of the change in impact properties over time. The dependent variable, , is a measure of the shoe’s durability after a repeated impact test. A random sample of 15 types of currently manufactured running shoes was selected for testing, and the following regression coefficients were found: At the 0.05 level of significance, determine whether each independent variable makes a significant contribution to the regression model (i.e. conduct two hypothesis tests) if the standard error for b1 is 0.06295 and the standard error for b2 is 0.07174. 182 183 To add to your Comprehensive assignment 38. A random sample of 90 observations produced a mean and a standard deviation a. Find an approximate 95% confidence interval for the population mean . (4 marks) b. Find an approximate 90% confidence interval for the population mean . (4 marks) c. Find an approximate 99% confidence interval for the population mean . (4 marks) 39. A survey was conducted by a broadcasting company in which they asked 501 satellite radio subscribers if they had a satellite radio receiver in their cars. They found that 396 subscribers did have a satellite receiver in their car. Find and interpret a 90% confidence interval for the proportion. (4 marks) 40. A Gold Association would like to measure the average distance traveled when a gold ball is hit by a machine. Suppose the association wishes to estimate the mean distance for a new brand to within 1 yard with 90% confidence. Assume that past tests have indicated that the standard deviation of the distance the machine hits golf balls is approximately 10 yards. How many gold balls should be hit by the machine to achieve the desired accuracy in estimating the mean? (4 marks) 41. A company warns that bottled water may contain more bacteria than allowed by law. Of the more than 1000 bottles studies, nearly one-third exceeded government levels. Suppose that the company wants an updated estimate of the population proportion of bottled water that violates 183 184 government standards. Determine the sample size (number of bottles) needed to estimate this proportion to within with 99% confidence. (4 marks) 42. A researcher wishes to determine whether prenatal alcohol effects learning in rats, whether learning in rats changes with age and whether the effects of prenatal alcohol depend on the age of the testing. The researcher uses a two by two design with five subjects per group. The following results table is produced: Source SS df MS F p 192.2 1 192.2 40.68 .05 A (alcohol) 57.8 1 57.8 12.23 .05 B (age) AxB (interaction) 168.2 1 168.2 35.60 .05 Within 75.6 16 4.725 Total 493.8 19 Determine the critical values for the first three rows of your table and compare with the calculated values in the table to help you draw your conclusion. (8 marks) 43. A mail-order catalog business selling personal computer supplies, software, and hardware maintains a centralized warehouse. Management is currently examining the process of distribution from the warehouse and wants to study the factors that affect warehouse distribution costs. Currently, a small handling free is added to each order, regardless of the amount of the order. Data collected over the past 24 months indicate the warehouse distribution costs, (in thousands of dollars), the sales, (in thousands of dollars), and the number of orders received . a. State the multiple regression equation if the regression coefficients were determined to be: (2 marks) b. Interpret the meaning of the slopes, and , in this problem. (4 marks) c. Does an interpretation of the regression coefficient, or why not? (2 marks) 184 , make any sense in this example? Why 185 d. Predict the mean monthly warehouse distribution cost when sales are $400,000 and the number of orders is 4500. (2 marks) e. At the 0.05 level of significance, determine whether each independent variable makes a significant contribution to the regression model if the standard error for is 0.0203 and the standard error for is 0.00225. (i.e. perform two hypothesis tests) (16 marks) Optional: 11.1.3 Conduct a test of hypothesis on each of the regression coefficients Final Exam up to here! 185