Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Survey

Document related concepts

Transcript

Review Statistics I Topics Building Blocks of my Statistics 1 course 1. Definitions 2. Data What types of data are available? How can data be collected? 3. Graphs How can data be graphed? How does the proportion of data in a range relate to probability? 4. How do you calculate population and sample averages? 5.For the population and sample, how do you calculate the typical distance a value is from its average? 6. How do you determine the probabilities associated with the bell-shaped curve? 7. What are the characteristics of all possible sample averages: mean, standard error, and distribution? 8. Estimation How do you infer about the population mean given the sample mean and the population standard error? How is the margin of error estimated if the standard error also has to be estimated? 9. Testing Hypothesis What are the new terms and definitions? How do you test a claim about a population parameter? 10. Review Questions BASIC BUILDING BLOCKS OF MY STATISTICS 1 COURSE 1. We will use random sampling: every object in the population should have the same chance of being in your sample as any other object. When using the sample mean to estimate the population mean, this will eliminate bias and, in most cases, reduce error. 2. Sample estimates tend to be in error: e.g., sample mean – population mean ≠ 0. 3. In order to evaluate an error, compare it to the standard error: sample mean population mean standard error (A) Note (a) The standard error consists of two components: a measure of variability and a measure of knowledge. (b) We evaluate the error using probability (c) If the probability is low either the sample was unlikely or one of the population values in the above ratio is not correct. 4. The margin of error (M.O.E.) is the largest error you would expect with a specified probability: -(M.O.E.) ≤ sample mean – population mean ≤ (M.O.E.) (B) where the size of the margin of error depends on the probability. Note (a) When you can solve for the population mean in the equation (B), the interval sample mean -(M.O.E.) ≤ population mean ≤ sample mean + (M.O.E.) (C) will contain the population mean with a specified probability (b) If the ratio of equation (A) falls between a positive and negative value with a specified probability, sample mean population mean -Value ≤ ≤ Value standard error then the margin of error can be found by multiplying the standard error times the value. For an introduction to a first level statistics go to http://wweb.uta.edu/faculty/eakin/busa3321/IntroductionToCourse.doc 1. Definitions Population – all the objects of interest: all cars, all households, all students Sample – a portion of the objects of interest: some cars, some households, some students Parameter – a number that describes some aspect of the population; e.g. the mean Statistic – a number that describes some aspect of the sample Example: A researcher is interested in determining information about net income (NI) of companies based on the type of company, the region (North or South), the amount of sales, and the amount of assets. Twenty companies are sampled. What objects are being collected? What would be the population and what would be the sample? What possible descriptions might be of interest? 2. Data a. What types of data are available? Quantitative – Numeric Values Qualitative – Values that fall into categories Example: Using the previous example, which ones are quantitative and which are qualitative? b. Data Collection (This is not a list of every possible type just some of the most common) i. Convenience Samples Data you have available; May or may not be random ii. Judgment Samples Data chosen based on a person’s decision about the correctness of collecting the observation; Usually not random iii. Random Samples (specifically a simple random sample) Every individual or item from the frame (a list) has an equal chance of being selected Measurements are typically direct measurements. iv. Surveys Type of sample where the measurement are responses from individuals. Typically some people do not respond which can bias the results Individual responses vary from day to day. v. Experiments. Similar objects are randomly placed into groups and a different treatment (drug, teaching method, work week, etc) is applied to each one. The effect of the treatment is measured after the application. In many cases a cause-and-effect relationship can be established. vi. Combinations of the above. 3. Graphs How can data be graphed? Qualitative Data – Bar and pie charts Quantitative Data – Break data into ranges and count number in each range. Let each range be a bar of the bar chart called a histogram. Example: The net incomes of ninety companies (in millions) are measured with the following ranges, number in each category and percentages were found: Range in Millions 10 up to 20 20 up to 30 30 up to 40 40 up to 50 50 up to 60 60 up to 70 70 up to 80 80 up to 90 Count Percent 32 19 14 12 8 3 1 1 36% 21% 16% 13% 9% 3% 1% 1% Percentage Distribution of Net Incomes 40% 35% Percentage 30% 25% 20% 15% 10% 5% 0% 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 Income Ranges in Millions How does the proportion of data in a range relate to probability? If every object in the population has the same chance of being selected, then the percentage in a range is the probability of values being the range. Example: What is the probability of finding a company whose net income falls in the range from 20 million to 50 million dollars? What type of sampling is needed for this? 4. How do you calculate population and sample averages? Both population and sample averages are found by adding up all the values and dividing by the number of them. Symbols: is the population mean and X is the sample mean 5. For the population and sample, how do you calculate the typical distance a value is from its average? Definition: The typical distance a value is from its average is called the Standard Deviation Calculation of Variance and Standard Deviation: a. Calculate the average of the values. b. Subtract the average from each value to see how far each value is from the average. c. Squaring each difference. d. Sum all the squared values e. To find the Variance i. For the population, divide the sum by the number of values (Symbol: 2) ii. For the sample, divide by the number of values minus one. (Symbol: s2) f. To find the Standard Deviation take the square root of the average in e. (Symbol: for population standard deviation and s for sample standard deviation) Both population and sample uses steps a-c and e. The difference between them occurs at step d below: Example: Calculate the population and sample standard deviations for a set of five numbers. Values 6 1 3 2 2 Step a: mean =2.8 Step b. Step c. Distance to Average Square the Distances (6-2.8)=3.2 (1-2.8)=-1.8 (3-2.8)=0.2 (2-2.8)=-0.8 (2-2.8)=-0.8 10.24 3.24 0.04 0.64 0.64 Step d. Step e. Step f. 14.8 Sum = 2= s2= 14.8/5 =2.96 14.8/4 =3.7 = 1.720465053 s= 1.923538406 For more examples, ctrl-click on the following link. Press F9 for another example. http://wweb.uta.edu/faculty/eakin/busa3321/calculating_variance_and_standard_deviation.xls Suggested Exercise (Use Internet Explorer rather than Firefox): https://wweb.uta.edu/faculty/eakin/asps/Examples/varCalcQues.asp Example of use: http://www.forbes.com/sport/2006/06/30/best-baseball-teams_cx_tvr_0705baseball.html 6. How do you determine the probabilities associated with the bell-shaped curve? The empirical rule, an approximation to the bell-shaped curve: A histogram with ranges based on the mean and standard deviation along with a specific set of percentages. Range Percent 2.5% - 3* up to - 2* 13.5% - 2* up to - 34.0% - up to up to + 34.0% 13.5% + up to + 2* 2.5% + 2* up to + 3* Example : Suppose the ages of the buyers of a product were collected. The buyers had an average age of 30 with a typical deviation of 5. The ranges and percentages become: Range Percent 15 up to 20 2.5% 20 up to 25 13.5% 25 up to 30 34.0% 30 up to 35 34.0% 35 up to 40 13.5% 40 up to 35 2.5% Probability of Being Within the Range Empirical Rule Example 40% 35% 30% 25% 20% 15% 10% 5% 0% 15 up to 20 up to 25 up to 30 up to 35 up to 40 up to 20 25 30 35 40 35 Age Intervals Based on the Mean and Standard Deviation What is the probability that the next buyer will be between 20 and 35 years of age? Other examples: Ctrl-click on the following link and press the F9 key for another example. http://wweb.uta.edu/faculty/eakin/busa3321/empiricalrule_example.xls Suggested Exercise (Use Internet Explorer rather than Firefox): https://wweb.uta.edu/faculty/eakin/asps/Examples/empiricalruleQues.asp Bell-Shaped Curve – If more than six ranges are considered and the tops of the histogram bars are connected, a bell-shaped curve occurs. For an infinite number of intervals, the bell-shaped curve is also called the normal distribution. Example of use: http://www.hardballtimes.com/main/article/face-forward-please/ The probabilities of values being within specific intervals have been tabled based on how far a value falls from the center in number of standard deviations. This is called the standard normal (or Z) table. For examples on graphing regions of the normal distribution double click the embedded Excel file below. Change the values in red and scroll down to see the pictures of the probabilities. Click on the Excel tabs to see probabilities greater than, less than, or between two values. *Values in red can be changed. X Z-Value 20 10 18 -0.2 The probability of finding an X value below 18 is 0.4207 If the above Excel file does not work, you can find the file at: http://wweb.uta.edu/faculty/eakin/busa3321/graphingnormal.xls 7. Distribution of Sample Means What are the characteristics of all possible sample averages: mean, standard error, and distribution? If repeated samples of the same size are drawn from a very large population, the following result: a. The average of all the sample averages will be the same as the average of the original population since both use the same numbers. b. From the introduction, the typical (or standard error) in the sample average is a function of two items: variability and knowledge. The standard error is the fraction of the population standard deviation divided by the square root of n. The square root is used because of diminishing returns of n. As an analogy, you typically learn more going from 1 to 2 years on the job than you learn from 28 to 29 years on the same job. Symbol: is the population standard error and is the sample estimate of the standard error c. The larger the sample size, the closer the distribution of a sample average is to a normal distribution. (If the original data is normal, then samples of any size will result in means that are normal). Example: Suppose you take all possible random samples of size 4 from the following population of size 6: {1, 2, 3, 4, 5, 6}. Average of the population is 3.5 Original Population Value Probability 1 16.7% 2 16.7% 3 16.7% 4 16.7% 5 16.7% 6 16.7% Possible Samples {1, 2, 3, 4} {1, 2, 3, 5} {1, 2, 3, 6} {1, 2, 4, 5} {1, 2, 4, 6} {1, 2, 5, 6} {1, 3, 4, 5} {1, 3, 4, 6} {1, 3, 5, 6} {1, 4, 5, 6} {2, 3, 4, 5} {2, 3, 4, 6} {2, 3, 5, 6} {2, 4, 5, 6} {3, 4, 5, 6} Sample Mean 2.5 2.75 3 3 3.25 3.5 3.25 3.5 3.75 4 3.5 3.75 4 4.25 4.5 Sampling Distribution of Sample Means Sample Mean Probability 2.5 7% 2.75 7% 3 13% 3.25 13% 3.5 20% 3.75 13% 4 13% 4.25 7% 4.5 7% Distribution of Original Data Distribution of All Possible Sample Means 18.0% 25% Probability of Sample Mean Having this Value 16.0% Probability 14.0% 12.0% 10.0% 8.0% 6.0% 4.0% 2.0% 20% 15% 10% 5% 0% 0.0% 1 2 3 4 Values 5 6 2.5 2.75 3 3.25 3.5 3.75 4 4.25 4.5 Possible Sam ple Means What is the average of the original population? Average of all possible sample means? What is the range of the original population? What is the range of all possible sample means? What shape is the distribution of the original data? The sample means? Finding probabilities of sample means. Change the value of the sample mean to a z-score and then use a table to look up the probability. For examples click on the following link: http://wweb.uta.edu/faculty/eakin/busa3321/graphingnormalmean.xls 8. Estimation: How do you infer about the population mean given the sample mean and the population standard error? 8.1 Estimation of population mean when the population variation is known. Putting all the previous information together, we estimate the population mean to be the sample mean plus or minus some multiple of the standard error where the multiple depends on the probability from a standard normal table. What we add and subtract is called the margin of error and usually this is ignored in newspapers and business reports. See http://www.businessweek.com/the_thread/hotproperty/archives/2007/01/why_we_ignore_t.html Probability 80% 90% 95% 98% 99% Number of Standard Errors 1.28 1.645 1.96 2.33 2.576 Example: Suppose from a random sample of size 49, we find a sample mean of 30. It is known that the typical distance a value is from the population (standard deviation) is 35. What is the population mean with 95% confidence? Solution: Identifier: “What is (or estimate) the population mean?” First calculate the typical error in a sample mean. This is value is 35 divided by the square root of 49 = 5. Therefore when using this sample mean the typical error you would expect is five. Next determine how far you have to go either side of the sample mean for the specified confidence. With 95% confidence you have to go 1.96 standard errors (1.96*5=9.8) either side of the sample mean to have 95% confidence that the population mean is within the interval. With 95% confidence we can say that the population mean is 30 with a maximum possible error of 9.8 For other examples, ctrl-then click on the following link. Press the F9 key for other examples: http://wweb.uta.edu/faculty/eakin/busa3321/zconint.xls Suggested Exercise (Use Internet Explorer rather than Firefox): https://wweb.uta.edu/faculty/eakin/asps/Examples/ZConfIntForMuQuesGrad.asp If you want to work more than one of the above exercises, then after completing one exercise use the Back command in the Internet Explorer browser and refresh the first screen. 8.2 Estimation of the population proportion, , a special case of a population mean 8.2.1 Background: Consider a population of size 5 where there are 3 successes and two failures. The probability of a success in the population, p, equals 3/5= 0.60. Consider recording the five values where successes are recorded as 1’s and failures are recorded as 0’s. Find the variance of this list of 0’s and 1’s using the rules from section 5: Values b. Distance to Mean c. Squared Distance 1 1 – 0.60 = 0.40 (0.40)2= 0.16 1 1 – 0.60 = 0.40 (0.40)2= 0.16 1 1 – 0.60 = 0.40 (0.40)2= 0.16 0 0 – 0.60 = -.60 (0.60)2= 0.36 0 0 – 0.60 = -.60 (0.60)2= 0.36 a. = 3/5 = 0.60 d. Sum = 1.20 e. 2 = 1.20/5 = 0.24 (divide by 5 since it’s a population) Note: From a. we see the population proportion is a population mean and from e. that the population variance is 0.60*0.40=p(1-p) Thus when estimating the population proportion, P, the sample proportion, p̂ , becomes a special case of a sample mean and we can use the rules of section 7 with 2 replaced by p(1p) and with the word “mean” replaced with “proportion”: [Note: in other textbooks notation changes where denotes the population proportion and p denotes the sample proportion] What are the characteristics of all possible sample proportions: mean, standard error, and distribution? If repeated samples of the same size are drawn from a very large population, the following result: a. The average of all the sample proportions will be the same as the proportion of the original population that are successes since both use the same numbers. b. From the introduction, the typical (or standard error) in the sample proportion is a function of two items: variability and knowledge. The standard error is the fraction of the population standard deviation divided by the square root of n. p̂ p(1 p) p̂(1 p̂) is the population standard error and Sp̂ is sample standard error n n (or the estimate of the population standard error.) c. The larger the sample size, the closer the distribution of a sample proportion is to a normal distribution. (A sample size is large enough if both np and n(1-p) are greater than or equal to the value 5. In the case where p is unknown, a sample size is large enough if you have at least 5 successes and 5 failures in the sample) 8.2.2. Estimation of population proportion, P Use the same rules as a confidence interval for a population mean with the word “mean” replaced with the word “proportion”. Solution Steps: Identifier: “What is (or estimate) the population proportion?” First calculate the standard error in a sample proportion. Since the population proportion is not known we can only use the sample standard error. Next determine how far you have to go either side of the sample proportion for the specified confidence. This is called the margin of error. For example, with 95% confidence you have to go 1.96 standard errors either side of the sample proportion to have 95% confidence that the population proportion is within the interval. Next make your conclusion. With a specified confidence we can say that the population proportion is the sample proportion plus or minus its margin of error. Example: With 90% confidence, estimate the population proportion of all students who would understand this lecture, if you had observed a random sample of 50 students and find 20% who understand it. Solution Steps: Identifier: “What is (or estimate) the population proportion?” First calculate the standard error in a sample proportion. Since the population proportion is not known we can only use the sample standard error. The sample standard error is the square root of [0.20 * ( 1-0.20) / 50] = 0.056569. Next determine how far you have to go either side of the sample proportion for the specified confidence. This is called the margin of error. In this case, the margin of error is 1.645*0.056569 = 0.093055252 Next make your conclusion. We estimate that the population of all students who would understand this lecture is 20%. With 90% confidence this estimate is off by no more than plus or minus 9.3%. More examples: https://wweb.uta.edu/faculty/eakin/busa3321/ZConIntforP.xls Suggested Exercise (Use Internet Explorer rather than Firefox): https://wweb.uta.edu/faculty/eakin/asps/Examples/ZConfIntforPIQuesGrad.asp If you want to work more than one of the above exercises, then after completing one exercise use the Back command in the Internet Explorer browser and refresh the first screen. 8.3 Estimation of population mean if the population is normal but the population standard error is unknown The standard normal table, given a probability, determines the number of standard errors a sample mean is from the population mean. If the standard error is not known we use the sample estimate of it (shown above) and we must change to a table that determines the number of estimated standard errors a sample mean is from its population mean for a given probability. This is the t-table: http://wweb.uta.edu/faculty/eakin/busa3321/alternativettable.doc There are three column headings. The second set labeled “Within” is used with confidence intervals. Example: for 98% confidence, go to the column heading labeled “within” and find the 0.98 column. The rows correspond to the degrees of freedom which is n-1 for the sample mean. Example: We wish to estimate the population mean with 90% confidence based on a sample of size 20. Using the t-table, we would go to row 19 and column 0.05. You would have to go 1.7291 sample standard errors either side of the sample mean to have 90% confidence that the population mean is in the interval. Another example: You wish to estimate the average number of housing starts in all large cities in the United States. You have a random sample of 25 cities and obtain the number of housing starts in each. The sample mean is 525 with a sample standard deviation of 40. Solution: Identifier: “What is (or estimate) the population mean?” First calculate the typical error in a sample mean. This is value is 40 divided by the square root of 25 = 8. Therefore when using this sample mean the typical error you would expect is estimated to be eight. Next determine how far you have to go either side of the sample mean for the specified confidence. With 95% confidence and 24 degrees of freedom, you have to go 2.0639 standard errors (2.0639*8 = 16.5112) or 16.5112 either side of the sample mean to have 95% confidence that the population mean is within the interval. With 95% confidence we can say that average number of housing starts in all cities of interest is 525 with a maximum possible error of 16.5112 Requirements to use a t table: Original population must be normal or a very large sample Simple random sample Another Example: You are measuring the size of houses in a city in thousands of square feet. From a random sample of 225 houses, you find a sample mean and standard deviation of 2.0 and .2 respectively. With 90% confidence, what is the estimate of the average house size of all houses in the city and what is the estimate’s margin of error? Give the conclusion as if you were talking to someone who has not had statistics. Go to the following link for the solution. Notice how you solve the problem by putting the example side-by-side with a previously solved exercise: http://wweb.uta.edu/faculty/eakin/busa3321/ReviewStat1tCIexample.doc For other examples, ctrl-click the link below. Press F9 to see another example. http://wweb.uta.edu/faculty/eakin/busa3321/tconInt.xls Suggested Exercise (Use Internet Explorer rather than Firefox): https://wweb.uta.edu/faculty/eakin/asps/Examples/tConfIntForMuQues.asp Use of estimation in Business: http://www.imcstlouis.org/artman/publish/article_20.shtml Use in Baseball: http://www.baseballmusings.com/archives/009154.php Basketball (search for “confidence interval”): http://www.basketballprospectus.com/unfiltered/index.php?s=carmelo (ab)use in newspapers: http://www.nytimes.com/2007/04/08/opinion/08pubed.html 8.4 Notes on when to use Z or t (In all cases you must have a random sample.) 8.4.1 use the Z distribution when a. you are conducting a confidence interval or hypothesis test on a population mean with the population standard deviation known AND the sample means are approximately normally distributed: The sample means are normally distributed for any sample size if the original population is normally distributed. The sample means are approximately normally distributed when the sample size is at least 5 and the original population is approximately symmetric. If you do not know the shape of the distribution for the original population a sample size of at least 30 will give sample means that are approximately normally distributed for most populations. b. you are conducting a confidence interval or hypothesis test on a population proportion AND the sample proportions are approximately normally distributed: The sample proportion is approximately normally distributed if the sample size is large enough that both np and n(1-p) are greater than or equal to the value 5. In the case where p is unknown, a sample size is large enough if you have at least 5 successes and 5 failures in the sample. 8.4.2 Use the t distribution when you are conducting a confidence interval or hypothesis test on a population mean with the population standard deviation unknown AND: The sample is large (the t and z become almost the same value in that case) OR The original population is normally distributed. 9. Testing Hypothesis What are the new terms and definitions? Null hypothesis: the status quo, a given value of a population parameter, something you wish to reject Alternative hypothesis: the opposite of the null and something you wish to support Example: In the past, the average paper length in the process has been 11 inches. You wish to detect a problem with the process if it occurs. (Population means no longer 11 inches) Null hypothesis: The population average is still 11 inches Alternative: the population average is no longer 11 inches. Types of Errors Type 1. Rejecting the null when it is true. Type 2. Not rejecting the null when it is false. Example: Using the above example Type 1. Saying the process is out of control when it isn’t. Type 2. Saying the process is in control when it is actually out of control. Probabilities Probability of a type 1 error is , also called the level of significance. Probability of a type 2 error is . Probability of rejecting the null when it is false is 1-, also called the power of the test. Probability of finding your sample estimate (or something more extreme) if the null was true is called the p-value. Small p-values imply that you are either very unlucky or the null is false. Test Statistic: a sample calculation used to test the null hypothesis. In sample means, this is the number of sample standard errors your sample mean is from the hypothesized value. (You know the sample mean will not equal the population mean so you compare the observed difference with what should be typical) Rejection Region: values of the test statistic that would be unlikely if the null was true. How do you test a claim about a population parameter? Two approaches that give the same result: Rejection Region approach: Decide values of the test statistic that are unlikely if the null was true based on the level of significance. Reject the null if the sample test statistic falls into this range. p-value approach. Determine the likelihood of obtaining your sample data (or something more extreme) if the null was true. If this probability is less than the significance level, reject the null. Using p-values in baseball: http://www.hardballtimes.com/main/article/breaking-the-pitcher Using p-values in business http://denver.bizjournals.com/denver/prnewswire/press_releases/national/Australia/2007/08/28/HKTU007 Example: The average paper length in a manufacturing process has been 11 inches in the past. You think the process is producing paper that is too short. You take a random sample of 36 sheets and determine that the sample average is 10.98 and the sample standard deviation is 0.06 inches. At the five percent level of significance can you say the process is out of control. Solution: Identifier: Does the population mean take on a specific set of values? In this case is the average paper less than 11? Determine the null and alternative. (What you wish to show goes in the alternative and the equal sign goes in the null) The null is that the average paper length is 11 inches The alternative is that the average paper length is less than 11 inches. Determine the values of the sample mean that would be unlikely if the average paper length was 11 and would support the alternative. In this case small sample means would cause you support a small population mean and lead you to reject the null and support the alternative. Using the t-table with 35 degrees of freedom, any sample mean more than 1.6896 sample standard errors or more below the mean would occur only five percent of the time. (Only one side is considered so the alpha is not divided by two.) Determine how far your data is below the hypothesized value. In this case the sample standard error is 0.01 (0.06 divided by the square root of 36). Your sample mean then falls two standard errors below 11 inches. This is an unlikely number and would cause you to say that the null is false and support the alternative. Recapping: Null: = 11 Alternative: < 11 Rejection Region: Reject Ho if t < -1.6896 Test Statistic: t = (10.98 – 11) / 0.01 = -2 ( two sample standard errors below) Decision: Reject the null and support the alternative. Conclusion: We can say the process is out of control and producing paper that is too short on average. Notes 1. If your test statistic does not fall in the rejection region, all you can conclude is that it is possible that the population mean could still be 11. (It is possible that the process is producing paper that is too short but our test could not detect it). 2. If you have a two-sided alternative (the mean is not 11 inches), your rejection region uses the “outside” column heading in the t-table. For other examples, ctrl-click the link below. Press F9 to see another example. http://wweb.uta.edu/faculty/eakin/busa3321/thyptest.xls YOU MUST WORK ONE OF EACH OF THE THREE TYPES: (A) A TWO-SIDED TEST, (B) A LEFT-SIDED TEST, AND (C) A RIGHT-SIDED TEST. Suggested Exercise: A real estate agent claims that the average house size in a city is 2,500 square feet. You take a random sample of 225 houses in that city measuring the size of the houses in thousands of square feet. You find a sample mean and standard deviation of 2400 and 200 respectively. At a 5% level of significance, can you conclude that the real estate agent is incorrect? Give the conclusion as if you were talking to someone who has not had statistics. Examples of use in baseball : http://www.insidethebook.com/ee/index.php/site/comments/another_nail_in_the_h ot_hand_coffin/ Examples in business http://en.wikipedia.org/wiki/Six_Sigma 10. Review Questions – Moved to Blackboard