Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
AMS 5 REVIEW VARIABLES QUANTITATIVE (numerical scale) Discrete (e.g. size) QUALITATIVE (e.g. marital status) Continuous (e.g. age) Average and Standard Deviation The median of a histogram is the value with half the area to the left and half to the right. In a symmetric histogram the median and the average coincide. Design of Experiments To eliminate bias, subjects are assigned to each group at random and the experiment is run double blind. This is called a Controlled Experiment and allows to establish a causal effect of the treatment on the response. Sample Surveys A population is a class of individuals that an investigator is interested in. A full examination of a population requires a census. If only one part of the population is examined, then we are looking at a sample. There are usually some numerical characteristics of the population that we are interested in. These are called parameters. Parameters are unknown quantities which are estimated using statistics, which are numbers that can be computed from the sample. When considering the quality of a survey keep in mind two possible sources of bias: Selection bias and Non-response bias Probability How do we quantify chance? Notation Consider an event A, then the probability of A is denoted as P(A) Consider two events, A and B, then the conditional probability of A given B is denoted as P(A|B) The multiplication rule can be written as P(A and B) = P(A|B)P(B) = P(B|A)P(A) A and B are independent if P(A|B) = P(A) and P(B|A) = P(B) When two events are independent the multiplication rule is P(A and B) = P(A)P(B) Addition Rule The mathematical notation for this is P(A or B) = P(A) + P(B) - P(A and B) Expected Value and Standard Error • 68% of the draws will be within one standard unit of the expected value. • 95% of the draws will be within two standard units of the expected value. • 99% of the draws will be within 2.5 standard units of the expected values. Problem 1: The figure below is a histogram for the scores on the final of a certain class. 1. What percentage of the students scored between 20 and 40 points? The boxes in the histogram correspond, from left to right, to 10%, 10%, 10%, 20%, 25%, 12.5% and 12.5% of the scores. From 20 to 40 there are 30% of the scores. 2. What is the median score of the class? The median is 40 Problem 2: Which of the following statements is true and which is false? 1. If two events are independent then they are mutually exclusive. F 2. If A and B are two events, then according to the multiplication rule, the probability that both A and B happen equals the probability of A times the probability of B. F 3. The ages of 10 freshmen and two professors are recorded, then the average age of the group is larger than the median age. T 4. One kilogram is approximately equal to 2.2 pounds, this implies that the standard deviation of the amount of fish consumed in a restaurant per day is larger if measured in kilos than if measured in pounds. Amount of fish in Pounds = 2.2 × (amount of fish in Kilos). Then SD(Kilos) = 2.2 × SD(Pounds). So, the SD of the amount in Kilos 1/2.2 times the SD of the amount consumed in Pounds. Then answer is FALSE. The important part of this question is that the SD changes when the units are changed. 5. 100 tickets are drawn at random with replacement from the box containing and you win the sum of the tickets. The same game is repeated using the box You expect to win the same amount in both cases. T 6. A high non-response rate is a serious problem for survey organizations because the investigators have to spend more time and money getting additional people to bring the sample back to its planned size. F Problem 3: A large number of people get together. Each person rolls a die 180 times, and counts the number of 1 's. About what percentage of these people should get counts in the range 20 to 40? The expected number of 1 's is 30 and the standard error is 5, so the interval 20 to 40 corresponds to two standard errors and thus contains approximately 95% of the counts. So we expect that approximately 95% of the people will get counts in that range. Problem 4: Two students A and B are both registered for a certain course. Assume that student A attends class 80% of the time and student B 60% of the time, and the absences of the two students are independent. 1. What is the probability that both students will be in class on a given day? Since the events are independent the multiplication rule can be used with the unconditional probabilities. This yields 48%. 2. What is the probability that at least one of the two students will be in class on a given day? We can either use the addition rule subtracting the probability calculated in the previous question or think of the opposite event and use the multiplication rule. The answer is 92%. Problem 5: Suppose the election of the president of the student union is conducted using the following system: students vote in the college they belong to and the candidate that wins the highest number of colleges is elected. A candidate wins a college if he or she obtains the majority of the votes in the college. Is it possible that the winner will not have a majority of the total vote? If yes, explain, if no, give an example. This is related to Simpson's paradox. It is possible that the winning candidate will not have the majority of the student vote. One situation where this this may happen is the following: the candidate that wins more than half of the colleges does so by winning with a small margin in colleges where the number of voting students is relatively small. At the same time he or she looses by a large margin the colleges where the number of voting students is relatively large. Normal Density The Gaussian or normal curve corresponds to the following formula 2 1 y= 2π e −x / 2 , e=2.71828 The area below the curve is equal to one. We observe that the curve is symmetric around zero and that most of the area is concentrated between -4 and 4. The probability of an interval is the corresponding area under the curve. Doing calculations with the normal curve requires the use of a table. Tables are available for the standard normal curve and they require that observations be transformed to standard units. With the help of the tables of the book we can find P((-z, z)) • • • • • • P((0, z)) = P(-z, 0) = 1/2 × P((-z, z)) P((-z, x)) = P((-z, 0)) + P((0, x)) P(<-z) + P(> z) = 1 – P((-z, z)) P(> z) = 1/2 × [P(< -z) + P(> z)] = 1/2 × [1 [ – P((-z, z))] P(< z) = P( < 0) + P(0, z) = 1/2 + P(0, z) P ((-z, x)) = 1/2 × [P((-x, x)) – P((-z, z))] Percentages A box contains tickets with 0s and 1s. The SD of the box is given by (fraction of 1s) × (fraction of 0s) The SE for the sum of 1s is number of draws × SD The SE for the percentage of 1s is SD number of draws • Sample percentage ± 1 SE is a 68% confidence interval of the percentage. • Sample percentage ± 2 SE is a 95% confidence interval of the percentage. • Sample percentage ± 3 SE is a 99.7% confidence interval of the percentage. Estimating Averages The normal approximation can be used to create confidence intervals for the average. Remember that a confidence interval of, say, 95%, means that if the experiment is repeated 100 times, about 95 of the resulting intervals will contain the true value of the average. Standard Errors Test of Significance • set up the null hypothesis • pick a test statistics to measure the difference between the data and what is expected under the null hypothesis • compute the test statistics and the corresponding observed significance level. In general we are calculating a test statistics given by observed - expected z= SE which is referred to as the z-test. The smaller the P-value, the stronger the evidence against the null, but The t-test Step 1: Consider a different estimate of the SD number of measurements SD = × SD number of measurements-1 + Notice that SD+ > SD. observed - expected Step 2: t = where SE+ corresponds to SD+. + SE Step 3: To find the observed significance level we can not use the normal curve any more. We need to use a Student's t curve. This curve depends on the degrees of freedom (DF). These are calculated as degrees of freedom = number of measurements - 1 Test of Difference H0 : the difference is 0 H1 : the average of one group is bigger than that of the other Two – Tailed Test H0 : the difference is 0 H1 : the average of one group is bigger than that of the other When a two tailed test is used the p-value is calculated adding the area that corresponds to both tails of the normal curve. Correlation • The correlation is not affected when the two variables are interchanged. • The correlation is not changed if the same number is added to all the values of one of the variables. • The correlation is not changed if all the values of one of the variables is multiplied by the same positive number. It will change sign if the number is negative. • The correlation coefficient is 1 if the variables have perfect positive linear association and -1 is they have perfect negative linear association. Regression Associated with an increase of one SD in x there is an increase of r × SDs in y on average. error = actual value of y - predicted value of y RMS error = 1 − r 2 × SD of y Problem 6: The speed of light is measured 25 times by a new procedure. The 25 measurements are recorded and show no trend or pattern. The average of the measurements is 299,789.2 kilometers per second and the SD is 12 kilometers per second. Find an approximate 95% confidence interval for the speed of light. 1. Calculate the SE of the average. The SE is given by 12 / 25 = 2.4. 2. Find an approximate 95% confidence interval for the speed of light. Two SEs correspond to 4.8 km per second. Thus a 95% confidence interval is given by (299,784.4 , 299,794). Problem 7: A simple random sample of size 400 was taken from the population of all manufacturing establishments in a certain state. The results are that 16 establishments had 250 employees or more. 1. Estimate the percentage of manufacturing establishments with 250 employee or more. 4% 2. Attach a standard error to the estimate. 0.04 × 0.96 ≈ 0.01. 400 Problem 8: Find the area under a Student's t curve with 3 degrees of freedom in the following cases: 1. To the right of 2.35. 5% 2. To the left of -2.35. 5% 3. Between -2.35 and 2.35. 90% 4. Are these values higher or lower than the ones that correspond to the standard normal curve? a) and b) are smaller for the normal, as a consequence, c) is larger. Problem 9: Looking at data and making sense of them is the first step of a statistical analysis. The scatter diagram below shows the ages of 1,000 husbands and wives in a town in California. Explore the plot. Is there anything wrong with the data? The range of x does not correspond to the usual range of married men. In particular, there is a 5 years old man married to a 20 years old woman. Problem 10: True or false: 1. To make a t test with 4 measurements use a Student's t curve with 4 degrees of freedom. F 2. For a given experiment the null hypothesis is that the average is equal to 231 units. The alternative hypothesis is that the average is above 231 units. You compute a z-test and the corresponding value P-value is 2.5%. The conclusion is that the probability that the average is equal to 231 units is 2.5%. F 3. The R.M.S. error for a regression line of y on x is less than or equal to the SD of y. T 4. The correlation between the daily minimum temperatures of L.A. and San Francisco is higher when measured in Fahrenheit than when it is measured in Celsius. F 5. The correlation between two variables is -.92, this implies that there is a strong negative linear association between the variables. T Problem 11: Freshmen at public universities work 12.2 hours a week for pay, on average, and the SD is 10.5; at private universities the average is 9.2 hours and the SD is 9.9 hours. Assume the data are based on two independent samples, each of size 1,000. Is the difference due to chance? 1. Formulate the null and the alternative hypothesis. H0 : There is no difference between public and private universities. H1 : Students at public universities work longer hours than those at private universities. 2. Calculate the SE for the difference of the averages. 10.5 SE public = ≈ 0.33 1, 000 9.9 SE private = ≈ 0.31 1, 000 then the SE of the difference is 0.332 + 0.312 = 0.45 3. Calculate the appropriate test statistics. 12.2 − 9.2 = 6.7 0.45 4. What is your conclusion? The null hypothesis is rejected since the P-value is VERY close to 0. Problem 12: A statistical analysis is made of the midterm and final scores in a large class. The results are average midterm score ≈ 60, SD ≈ 15 average final score ≈ 65, SD ≈ 20, r ≈ 0.50 1. Using the normal approximation, about what percentage of the students scored over 80 on the midterm? 80 points on the final corresponds to 80 − 60 = 1.33 15 standard units. Using the normal we obtain that approximately 9% of the students scored over 80 on the midterm. 2. What is the R.M.S. error? 2 1 − 0.5 × 20 = 17.32 3. What is the slope of the regression line? 0.5 × 20 = 0.67 15 4. What is the predicted final score for a student who scored 80 in the midterm? 80 points on the midterm is 1.33 SD units above average. This corresponds to 1,33 × 0,5 = 0.67 SD above average on the final. That corresponds to 13.4 points over average on the final, so the students that scored 80 on the midterm, scored, on average, 65 + 13.4 = 78.4 on the final. 5. Of the students who scored 80 on the midterm, about what percentage scored over 80 on the final? In standard units we have 80 − 78.4 = 0.09 17.32 and there is an area of about 46% to the right of this value under the normal curve.