Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Focus Questions Chapters 1-5 Directions: Answer all questions asked and show work and formulas used. 1. Suppose you have the following sequence of numbers: 2, 7, 18, 11, 8, 9, 15, 21, 25, & 32 a. Calculate the mean of the following data set. b. Calculate Q1 Q2 & Q3 and create a 5-number summary. Make sure to include IQR and identify if there are any outliers. c. Calculate the variance and standard deviation. d. Create a stem and leaf display of the following data (a) Recall that to calculate the mean we sum up the values and divide by N. In this case we have 10 data points, so N=10. μ = ∑xi / N = (2+7+18+11+8+9+15+21+25+32) / 10 = 14.8 (b) Now we want to calculate the three quartiles. To do this the first thing we need to do is put the data in order from lowest to highest. 2, 7, 8, 9, 11, 15, 18, 21, 25, 32 To calculate the 25th (and all percentiles for that matter) percentile we use the following formula Index – i = (p/100)n Q1 = (25/100)10 = 2.5; since we have a decimal we round up to 3. So the 3rd data point is the 25th percentile, or 8 Q2 = (50/100)10 = 5 since we have an integer value we take the ith value and i+1 value. In this case it is the 5th and 6th values. So we add (9+11) / 2 = 10 Q3 = (75/100)10 = 7.5; since we have a decimal we round up to 8. So the 8th data point is the 75th percentile, or 21 IQR = Q3 – Q1 = 21 – 8 = 13 Outliers would lie outside of Q1 and Q3 by a factor of 1.5*IQR So the upper and lower bounds for outliers are as follows: Upperbound = Q3 + 1.5* IQR = 21 + 1.5(13) = 40.5 Lowerbound = Q1 - 1.5* IQR = 8 - 1.5(13) = -11.5 So we check to see if any of the values are lower than -11.5 or higher than 40.5. Since we don’t have any values outside of these values we conclude there are NO outliers. Graph 5-Number Summary: -note that we can see that Q1 (the lower portion of the box) is 8, Q2 (the line that goes through the box) is 10, and Q3 is at 21 (top of the box). This 1 Boxplot of Data Set 35 Range of Values for Data 30 25 20 15 10 5 0 (c) Variance: s2= ∑ ( xi – μ)^2 / N = [ (2 – 14.8)2 + (7 – 14.8)2 + ….+ (25 – 14.8)2 + (32 – 14.8)2 ] / 9 = 85.29 Standard Deviation of GPA = s = 85.29 = 9.24 (d) To make a stem and leaf we use the ordered data and display as follows 0| 1| 2| 3| 2789 158 15 2 2 2. Suppose that we are looking at a data set that we identify as a normal distribution. If we are given that the mean is 16 and the standard deviation is 3, use this information to calculate the following information. a. P ( x < 17) b. P ( 13 < x < 21 ) c. What are the two points that will give you the middle 80% of the distribution? In each of the above situations make sure to draw the appropriate normal distribution to accompany each question. (a) In order to find the probability that we are less than 17 in the distribution we need to covert the normal distribution to a standard normal distribution so we can use our standard normal table. To do this we convert our value to a z-score. Recall that z = (x - μ) / σ, so z = ( 17 – 16 ) / 3 = .33 So we want the P ( z < .33 ) = .6293 Graphically: We include all the area to the left of 0.33. z 0 0.33 20 (b) We once again need to convert our data to z scores in order to use our standard normal table. So z = (x - μ) / σ, so z = ( 21 – 16 ) / 3 = 1.67 & z = (x - μ) / σ, so z = ( 13 – 16 ) / 3 = -1 This corresponds to P ( -1 < z < 1.67 ) = P (z < 1.67 ) – P ( z < -1 ) = .9525 - .1587 = 0.7938. Graphically: We don’t want to include the areas in the tails. z -1.0 0 1.67 20 3 (c) Now we want to find that points that will give us the middle 80%. Graphically we want: So the area in 80 % fall between +/- some zA and B are values which we can covert to 40% A B x values C So first we find the z-value to the right that gives 90% to the left of that value. We do this because in our table we are given values that give us all the area to the left of some zvalue. The tails are not excluded. So we Find the z-value that includes both A,B and C (the tail of the graph). This value is z = 1.28 [ note: 1.28 corresponds to a value of .8997, which is close enough to .90 for our purposes ]. Therefore if z = 1.28 gives us our desired value and gives 40% in B we know by symmetry that our z-value to the left of the mean should be z = -1.28 To find the x values that correspond to these z values we use the formula: Upper x = μ + zσ = 16 +1.28 ( 3 ) = 19.84 Lower x = μ - zσ = 16 -1.28 ( 3 ) = 12.16 Graphically: So we have found the values that give us the middle 80% of the distribution. We expect that 80% of all the data point swill fall between these values x 12.18 16 19.84 4 3. Suppose that you are given the following data pairs between GPA and ACT. **You may want to use excel to check your calculations ACT 23 27 17 31 33 GPA 3.1 3.6 3.9 3.3 2.9 ACT 25 26 17 29 35 GPA 3.1 3.9 3.7 3.5 4.0 a. Calculate the mean, variance, and standard deviation for both GPA and ACT b. Given the following data calculate the correlation coefficient for the data. c. What does your answer in part (b) imply about the relationship between ACT score and GPA? What does it tell you about simply looking at ACT score in order to predict GPA? (a) GPA: Mean: μ = ∑xi / N = 3.1 + 3.6 + …+ 3.5 + 4.0 / 10 = 35 / 10 = 3.5 Variance: σ2= ∑ ( xi – μ)^2 / N = [ (3.1 – 3.5)2 + (3.6 – 3.5)2 + ….+ (3.5 – 3.5)2 + (4.0 – 3.5)2 ] / 10 = .134 Standard Deviation of GPA = σ GPA = .134 = .366 ACT: Mean: μ = ∑xi / N = 23 + 27 + …+ 29 + 35 / 10 = 263 / 10 = 26.3 Variance: σ2= ∑ ( xi – μ)^2 / N = [ (23 – 26.3)2 + (27 – 26.3)2 + ….+ (29 – 26.3)2 + (35 – 26.3)2] / 10 = 33.61 Standard Deviation of ACT = σ ACT = 33.61 = 5.80 Note: when we use the … it simply means that we would carry out the same operation on all the other data points in between or beyond. I am just putting this to save time. (b) The general formula for correlation is as follows ρxy = σ xy / σ x σ y. When we apply them to the specific study the formula becomes ρACT,GPA = σ ACT,GPA / σ GPA σ ACT We have both the standard deviations, but we need the covariance. So below we will calculate covariance as follows: _ _ Population Covariance = σxy = (1/ N) ∑ ( xi – x ) ( yi – y ) = σ ACT,GPA = (1/ N) ∑ ( GPAi – GPA ) ( ACTi – ACT ) = (1/10) * [(3.1 – 3.5) (23 – 26.3) + (3.6 – 3.5) (27 – 26.3) + ….+ (3.5 – 3.5) (29 – 26.3) + (4.0 – 3.5) (35 – 26.3)] = -0.44 Now plugging all numbers into the correlation coefficient formula we get: ρACT,GPA = σ ACT,GPA / σ GPA σ ACT = -0.44 / (5.80) ( .366) = -0.20727 (c) This implies a weak negative relationship between GPA and ACT score. So we figure based on these results, although ACT score helps to predict GPA it is not the only component that determines how well a student performs inside the classroom. So, based on this idea we could simply say that ACT score would validly tell us how well a person would perform in school. On top of that, we get a relationship that is negative that 5 implies the higher the ACT scores the lower the GPA. This is the exact opposite result someone would expect. Alternate answer to question 3 ***now we can assume that we are dealing with sample data and not population statistics. So in this case we just have a portion of the overall data and want to calculate some estimate of what we think is going on overall. So maybe we have an entire graduating class with both GPA and ACT score, but we only look at 10 data points instead of the entire classes. GPA: _ Mean: x = ∑xi / n = 3.1 + 3.6 + …+ 3.5 + 4.0 / 10 = 35 / 10 = 3.5 Variance: s2= ∑ ( xi – μ)^2 / n-1 = [ (3.1 – 3.5)2 + (3.6 – 3.5)2 + ….+ (3.5 – 3.5)2 + (4.0 – 3.5)2 ] / 9 = .149 Standard Deviation of GPA = s GPA = .149 = .386 ACT: _ Mean: x = ∑xi / n = 23 + 27 + …+ 29 + 35 / 10 = 263 / 10 = 26.3 Variance: s2= ∑ ( xi – μ)^2 / N = [ (23 – 26.3)2 + (27 – 26.3)2 + ….+ (29 – 26.3)2 + (35 – 26.3)2] / 9 = 37.3 Standard Deviation of ACT = s ACT = 37.3 = 6.11 Note: when we use the … it simply means that we would carry out the same operation on all the other data points in between or beyond. I am just putting this to save time. (b) The general formula for correlation is as follows rxy = s xy / s x s y. When we apply them to the specific study the formula becomes sACT,GPA = s ACT,GPA / s GPA s ACT We have both the standard deviations, but we need the covariance. So below we will calculate covariance as follows: _ _ Population Covariance = sxy = (1/ n-1) ∑ ( xi – x ) ( yi – y ) = s ACT,GPA = (1/ n-1) ∑ ( GPAi ( ACTi – ACT ) = (1/9) * [(3.1 – 3.5) (23 – 26.3) + (3.6 – 3.5) (27 – 26.3) + ….+ (3.5 – 3.5) (29 – 26.3) + (4.0 – 3.5) (35 – 26.3)] = -0.489 – GPA ) Now plugging all numbers into the correlation coefficient formula we get: rACT,GPA = s ACT,GPA / s GPA s ACT = -0.489 / (6.11) ( .386) = -0.20734 We can see that the difference in the answer is very small. We have to go out to the ten thousandths place to find a difference. It completely depends on the data and its scale though. Sometimes we may find a larger difference, but we would never expect the difference to be very large. 6 5. Suppose you were told that there was a regression line that was fit to data between age and salary at OCCC given all the people that worked in the Business Department at the college and the results were: Income = 20000 + 455 * Age With an r2 = .45 a. Use this information to graph the regression line. Make sure to display the slope and intercept in your graph. b. According to the equation what is the predicted salary of someone with an age of 34? c. What might be some reasons that this regression equation is not a good predictor of salary? (i.e. can you think of any shortcomings of using solely age and income?) d. Given the square of the correlation, what type of relationship does age have with income and how well does age do at predicting income? (a) Given the equation we can graph the equation just as if we were graphing a typical mathematical expression. Keep in mind however this is not a typical mathematical expression. This expression comes from data and it is an estimation of how both age and income are related to each other in a simple linear regression setting. Table 1 Income 20,000 24550 29100 33650 38200 42750 Age 0 10 20 30 40 50 Graph 1: Income = 20000 + 455 * Age 33,650 10,000 10 20 30 40 50 (b) If we want to know the predicted salary for someone 34 we simply plug in the value to our least squares regression line as follows: Income = 20000 + 455 * Age 20000 + 455 * 34 = $35,470 7 (c) There are a couple of reasons the equation might not be such a good predictor of salary. i. Age might not be the best variable in which to predict salary. So finding the relationship between age and income might not be appropriate. Maybe education should be used. ii. Also, it makes no sense for us to talk about values of ages between 0-18. So although we find the least squares regression line it might not be appropriate to predict salary for certain age values. iii. Lastly, we have not assessed how closely the relationship is between the two variables. If the relationship is not strong it would indicate that there is no real relationship between the variables. See part d. (d) Recall that we are given that r2 = .45, so we know that the correlation coefficient is .45 = (+) 0.67. So we know that there is a positive correlation between the variables and that it is fairly strong, so age can probably be used to reasonably predict income for people, but we must note that the age range should be confined to ages of people that actually work at OCCC, so we should not go below 18. 8