Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Solutions to Sample Test #1 1 Descriptive Statistics 1.1 Small Sample A very small sample is collected, with the following observations: 2.1 – 1.2 1.1 0.3 Calculate1 • the mean • the median • the standard deviation • The range • The maximum, minimum and the midrange Solutions A spreadsheet's solution is Mean Median Population Standard Deviation Population Variance Sample Standard Deviation Sample Variance Range Minimum Maximum Midrange 0.575 0.7 1.207 1.457 1.394 1.943 3.3 -1.2 2.1 0.45 The numbers are rounded to three decimal digits. 1.2 Large Sample A data set consists of the following (sorted, and rounded) 40 values: −40.1 −14.1 −8.9 −1.3 0.3 1.7 2.4 3.7 3.8 3.8 1 4.1 4.2 4.3 4.6 4.7 4.8 4.8 5 5.4 5.5 5.6 5.7 5.8 5.8 5.9 6.1 6.1 6.2 6.3 6.3 6.5 8.3 9.2 9.9 10.6 10.7 12.4 26.9 27.2 30.9 We find the following summaries Number of data points 40 Sum of values: 210.64 Sum of their squares: 5610.44 1. Compute the sample mean, and the sample standard deviation. 2. From the table, extract the following values: • Minimum, Maximum, Range • 1st Quartile, Median, 3rd Quartile Solutions A spreadsheet's solution is Mean 5.277 Median 5.550 Population Variance 5.277 Population S.D. 10.61 Standard Deviation Sample Variance 10.745 115.448 Range 71.0 Minimum −40.1 Maximum 30.9 1 quartile 4.025 3 quartile 6.350 A Note on Quartiles You will notice that the median, which can be chosen as any number between 5.5 and 5.6, is picked by the computer to be the midpoint between these two values. As for the first quartile, which could be any number between 3.8 and 4.1, and the third, which could be any number between 6.3 and 6.5, the program chose, respectively, 4.025, and 6.350. As you can see, the first is closer to the upper bound, and the second to the lower bound, i.e. they are pushed towards “the middle”. As you can check for yourself on various web sites, this is only one of several “rules” in the literature for picking the “right” percentile. Of course, there is no “right” choice. Percentiles (quartiles being the special case of 25th and 75th percentile), are uniquely determined only when dealing with a continuum of data, as in probability distributions (e.g. the first quartile of the standard normal distribution is −0.67448975019608 , since the probability of such a variable to be less than that number is 0.25). These various rules are attempts to “interpolate” a continuum of data between the actual data, and depending on which (arbitrary) rule you choose, you get different outcomes. In any case, notice how irrelevant that is in any practical use of the information. 2 Probability: Normal Distribution 2.1 Constructed Normality “Grading on a curve” is a procedure by which student grades are changed to conform, approximately, to a normal distribution with a pre-assigned “true mean” (expectation, usually denoted by ) and standard deviation (usually denoted by ). Suppose an instructor takes as pre-assigned values μ= 2.2 , σ =0.8 . With these choices, what is the probability of a student 1 To fail to get a passing grade of 2.0, i.e., calling the grade G, what is P [G≤1.9 ] 2 To score 4.0 (or higher: a normal distribution will allow, in theory, to get any real number as a grade), that is P [ G≥4.0 ] Solutions 1. The z – score of 1.9, with our numbers for expectation and standard deviation is 1.9−2.2 =−0.375 . This is the difference (negative, since 1.9 is less than the expected 0.8 value) between our data point and the expected value, using the standard deviation as the unit. From tables, we can find, in the (by now) familiar way that the probability of being less than this for a standard normal variable is approximately 0.354. 4−2.2 =2.25 . The probability of exceeding this value 0.8 is given by 1 – the probability of a standard normal variable to be less than 2.25, so that the final answer is approximately 0.012 2. Similarly, the z – score for 4 is Note: In practice this would work like this: the sample mean and standard deviation are used to compute empirical “z – scores” for the exam results, and the published grades would follow the theoretical distribution so that, for example, to get a 4.0 a student would have to have an empirical z – score of 2.25 or better. If the actual distribution of the exam turned out to be anything close to normal, more than 35% of the student would fail, and about 1% would get a 4.0 (regardless of how well or badly they did in absolute terms). Maybe this instructor should adjust the parameters, or, even better, just drop the idea of grading on a curve. 2.2 Quality Control Normal models are generally not really good for “time to failure” issues that arise in quality control. Nonetheless, let's assume that a company has decided that the “lifetime” of a gadget can be described by a normal variable with parameters μ=1.1, σ=0.2 (time is measured in years). The company offers a 1-year warranty on its product. If the product fails before 1 year is over the company will have to refund the buyer, and lose $100 on the transaction. On the other hand, the company is also betting on customers wanting to “upgrade” to a newer version within 2 years of their purchase. 1. What is the probability of a random gadget to fail in the first year, and thus cost the company $1002 2. What is the probability that a random gadget will last longer than the 2 years, and thus make the owner think twice about “upgrading”? 2 If we call p the probability of warranty to kick in, and c the associated cost (in our case c = $100), the product pc is called the expected cost. In a cost-benefit analysis, this quantity would be compared to the expected profit in order to determine whether the failure rate was acceptable, financially speaking, or not. Solutions Call T the time to failure of our gadget. We are assuming that T ∼N (1.1,0 .04) , and we ask 1. What is P[T <1] ? To answer using tables, we “normalize” our variable, so the question becomes “what is P [ T −1.1 1−1.1 < 0.2 0.2 ] ?” The random variable on the left is a standard normal random variable, let's call it Z, so we are looking for P[Z <−0.5]≈0.3085 (0.5 is the z-score of 1). Thus, the average cost to the company would be about $30.85 (this last result was not asked in the text, but we might as well take note of it). Note that a 30% failure rate is substantial, but if the average profit for the company is higher than that, a cost-benefit analysis might conclude that it is in its best interest to produce such a shoddy product. 2. What is P[T >2] ? Similarly, since the z-score of 2 is 2−1.1 =4.5 , our answer is given by 0.2 P[Z > 4.5]=1−P [Z⩽4.5]≈0.0000034 . 4.5 is such a large z-score that it is not reported in standard tables. An answer of “practically 0” would be perfect. Extreme z-scores are easily handled by a computer, or one can refer to less common tables, reporting very small “tail probabilities”, like the ones attached at the end of the file (a table of the normal distribution, together with a table of small tail probabilities put in the public domain and available on the web)