Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Ms. Amal Jamil EL-Sayed Modelling variations (D2) This chapter is about modeling the variation observed in data; and it is concerned primarily with one particular model for variation called the normal distribution. There are two main learning file themes for this chapter. The first involves you in using the software package relevant to this block. This is the data analysis software OUstats for MST 121, which from now on we shall call OUstats. Consider how we might model the variation in men’s heights. The first step is to obtain some data. Some searching turned up an old data set which contains the heights of 1000 Cambridge men in 1902. These heights are a sample of the heights of all Cambridge men in 1902, and will be used to illustrate how the variation observed in men’s heights may be modeled (page 9). In the article the Cambridge men were taken to be representative of the general population. The following figure shows a frequent diagram for the height of 1000 Cambridge men in 1902. Figure 1.2 The heights of 1000 Cambridge men. Activity 1.1 : Properties of the data. Describe the shape of the frequency diagram in figure 1.2. Use the diagram to estimate very roughly the mean height of the men. Over what range are the heights of the men spread? Answer: 1. The frequencies are low on the left side of the diagram, rise steadily to reach a maximum for heights of 69 inches then decrease towards the right. 2. The frequency diagram is roughly symmetrical about a single peak or mode, it is unimodal. 3. The diagram could be described as approximately "bell-shaped". (So the model has been represented by a curve). 4. Since the diagram is roughly symmetrical, the mean height is approximately 69 inches, (the centra height). 5. The heights of the men range from approximately 61.5 inches to approximately 77.5 inches. Remark: Areas under the curve are used to represent proportions or probabilities. The normal distribution: 1) A bell-shaped curve would seem to be a good model for the variation in the data. A model which has this shape is the normal distribution. 2) The equation of a typical normal curve is y = f(x), where f(x) = 1 x 2 1 ) (-∞<x<∞) exp ( 2 2 3) The function f is known as the probability density function of the distribution. 4) The normal distribution is a Continuous model. 5) μ gives the location of the centre of the curve. σ governs the spread of the values that are most likely to occur (σ is the standard deviation) 6) The normal distribution with mean equal to zero and standard deviation equal to one is called standard normal distribution. Some properties of the normal curve: 1. Normal Curves are bell-shaped and are symmetrical with respect to a vertical line. 2. The mean is at the center. 3. Irrespective of the shape, the area enclosed by the Curve and the x- axis is always equal to 1. 4. The probability that an outcome of a normally distributed experiment is be between a and b equals the area under the associated normal Curve from x = a to x = b. 5. The standard deviation of a normal distribution plays a major role in describing the area under the normal Curve. 6. The Curve never meets the x-axis. The total area between the normal Curve and the x-axis from x = -N to x = N is given by N area = f ( x)dx N where f(x) = 1 x 2 1 ) exp ( 2 2 -∞<x<∞ area = N f ( x )dx = lim N f ( x) 1 N Activity 1.4 Interpreting the model Figure 1.13 contains three sketches of the particular normal curve used to model the heights of all Cambridge men in 1902. For each sketch, describe in words what the shaded area represents. Figure 1.13 Three areas under a normal curve used to model the heights of Cambridge men in 1902. Solution: a) The shaded area represents the proportion of all Cambridge men in 1902 who were between 69 and 71 inches tall. Alternatively it represents the probability that a man selected at random from all Cambridge men would have between 69 and 71 inches tall. b) The shaded area represents the proportion of all Cambridge men in 1902 who were under 65 inches tall. Alternatively it represents the probability that a man selected at random from all Cambridge men would have been less than 65 inches tall. c) The shaded area represents the proportion of all Cambridge men in 1902 who were over six feet tall. It represents the probability that a man selected at random from all Cambridge men would have been over six feet tall. Section 2: Choosing a normal model. 1. Probability distributions are used to model the variation in population. 2. Population parameters are population mean and population standard deviation are used to the normal model. The sample mean. Suppose that a sample of n observations x1,x2,……,xn is taken from a population the sample mean x= 1 1 n ( x1 x2 ...... xn ) xi . n n i 1 Example: (Find the mean of the heights of 1000 Cambridge men, Heights of 1000 Cambridge men. Height in inches 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 Frequency 3 20 24 40 85 122 139 179 139 107 55 47 22 12 5 1 1000 k x = n1 x f i1 i i 1 xi f i ....... xk f k n 1 (62 x = 1000 3 63 20 .......... 771) 68.872 So the sample mean is approximately 68.9 inches. Activity 2.1 Irish dipper nestlings Table 2.2 Weights of Irish dipper nestlings Weight in grams 9-11 11-13 13-15 15-17 17-19 19-21 21-23 23-25 25-27 27-29 29-31 31-33 33-35 35-37 37-39 39-41 41-43 43-45 45-47 Frequency 1 8 3 6 5 12 13 18 27 18 22 20 3 17 8 7 4 4 2 198 What values would you use to calculate the mean weight of these nestlings? Answer: We use the midpoints of the intervals. Weight in grams Frequency 9-11 11-13 13-15 15-17 17-19 19-21 21-23 23-25 25-27 27-29 29-31 31-33 33-35 35-37 37-39 39-41 41-43 43-45 45-47 Midpoint 1 8 3 6 5 12 13 18 27 18 22 20 3 17 8 7 4 4 2 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 198 So the sample mean 1 (10 1 12 8 .......... 46 x = 198 2) 41.53 The population mean. The formula for the mean of a continuous distribution which corresponds to formula (2.1) (page 25). For the mean of a discrete distribution is xf ( x)dx where f is the probability density function of the distribution. The standard deviation. Population variance = ( x m) 2 f ( x )dx The population standard deviation is defined to be the square root of the population variance. The sample standard deviation. Consider the sample x1,x2,……,xn of size n. δ2 = sample variance = 1 n xi x 2 n 1 i 1 x= δ = The standard deviation = var iance 1 n xi x 2 n 1 i 1 = Sample static population parameter is used to estimate μ x δ is used to estimate σ Exercise: The number of items of mail delivered on a Monday morning to each of 5 houses chosen at random from those in a large estate were as follows: 2 7 3 1 2 a) Find the sample mean and the sample standard deviation. b) If there are 3000 houses on the estate, which is your estimate of the total number of items of mail delivered to the estate on that morning? Answer: a) The sample mean x = 2 7 3 1 2 15 3 5 5 x xx 2 7 3 1 2 2 – 3 = -1 7–3=4 3–3=0 1 – 3 = -2 2 – 3 = -1 15 x x 2 1 16 0 4 1 22 n=5→n-1=4 The sample standard deviation = δ = b) The sample mean x 22 4 2.3 (to 1 d.p) may be used to estimate the mean number of items of mail delivered to each house o the estate: 3. so an estimate of the total number of items of mails delivered to the estate on that morning is 3 x 3000 = 9000. Example: The gross weekly earnings (in pounds) in 1995 of a sample of six mechanical engineers were as follow: 310 635 464 520 381 732 Find the sample mean and the sample standard deviation. Answer: The sample mean x = 310 635 464 520 381 732 3042 507 6 6 x xx x x 310 635 464 520 381 732 -197 128 -43 13 -126 225 38809 16384 1849 169 15876 50625 3042 The sample standard deviation = δ = 2 123712 123712 5 n=6→n-1=5 157.30 (to 1 d.p). So the sample mean of gross weekly earnings of 6 engineers was ₤ 507 and the sample St. deviation of gross weekly earnings was ≈ ₤ 157.30 Section 6 Suppose that a normal distribution is used to model the variation in a population. Then according to the model: o approximately 68.3% of the population are within 1 standard deviation of the mean (that is, between μ – σ and μ + σ); o approximately 95.4% of the population are within 2 standard deviations of the mean (that is, between μ – 2σ and μ + 2σ); o almost all the population – about 99.7% - are within 3 standard deviations of the mean (that is, between μ – 3σ and μ + 3σ). Note that these results hold true whatever the values of the mean μ and the standard deviation σ of the distribution. The results are illustrated in Figure 6.5. Figure 6.5 Areas under normal curves Example 6.1 The normal distribution used to model the heights of Cambridge men in 1902 has mean μ = 68.9 and σ = 2.57. According to this model within what range were the heights of almost all Cambridge men-that is about 99.7% of them? How does this compare with the sample of heights? Solution: Almost all Cambridge men (about 99.7%) were between μ -3σ and μ +3σ that i between (68.9-3x2.57) and (68.9+3x2.57) inches tall that is between 61.2 and 67.6 inches. So the model predicts that about three in a thousand men will be either shorter than 61.2 inches or taller than 76.6 inches. From the table (page 23) only one man’s height was outside the range 61.2 inches to 76.6 inches, no man was less than 61.5 inches tall. So almost all the men in the sample were within 3 standard deviations of the mean height. The normal model reflects quite well the proportion of men in the sample who were unusually short or tall. Do activity 6.1 page 37. Suppose that a normal distribution is used to model the variation in a population. Then according to the model: o approximately 90% of the population are within 1.64 standard deviations of the mean (that is, between μ – 1.64σ and μ + 1.64σ); o approximately 95% of the population are within 1.96 standard deviations of the mean (that is, between μ – 1.96σ and μ + 1.96σ); o almost all the population – about 99% - are within 2.58 standard deviations of the mean (that is, between μ – 2.58σ and μ + 2.58σ). Note that these results hold true whatever the values of the mean μ and the standard deviation σ. The results are illustrated in Figure 6.6. Figure 6.6 Areas under normal curves Example 6.2 According to the normal distribution used to model the variation in the heights of Cambridge men in 1902, within what ranges were the heights of approximately 90%, 95% and 99% of Cambridge men? Solution: μ = 68.9 σ = 2.57 Approximately 90% of Cambridge men where between 68.9-(1.64)(2.57) ≈ 64.7 and 68.9+(1.64)(2.57) ≈ 73.1 inches tall. Approximately 95% of Cambridge men where between 68.9-(1.96)(2.57) ≈ 63.9 and 68.9+(1.96)(2.57) ≈ 73.9 inches tall. Approximately 99% of Cambridge men where between 68.9-(2.58)(2.57) ≈ 62.3 and 68.9+(2.58)(2.57) ≈ 75.5 inches tall. Remark If X is a continuous random variable, then a function y = f(x) is called a (probability) density function for X if and only if it has the following properties: 1. f(x) ≥ 0 f ( x)dx 1. 2. 3. P(a ≤ X ≤ b) = b f ( x)dx a More practice: 1) Show that the function f(x) = 1, if 0 ≤ x ≤ 1 = 0, otherwise is a density function for X? Solution: We must verify that f(x) satisfies the three conditions in the definition of the density function. First, f(x) is either 0 or 1, so f(x) ≥ 0. Next, 1 f ( x)dx 1dx x 1 0 1. 0 Finally, to verify that P(a ≤ X ≤ b) = b f ( x)dx , we compute the area under a b the graph between x = a and x = b. We have f ( x)dx 1dx x a which, as stated in the above remark. b a b a b a, 2) Consider the exponential density function f(x) = e-x for x ≥ 0, and f(x) = 0 for x < 0. a. Find P (2 < X < 3). b. Find P (X > 4). Solution: 3 a. P (2 < X < 3) = e x dx e x 2 – e-3 – (–e-2) = e-2 – e-3 ≈ 0.086. 3 2 r b. P (X > 4) = e dx lim e x dx x r 4 4 = lim e x 4 lim e r e 4 r r r = lim r 1 e 4 0 e 4 r e ≈ 0.018. Another solution for part b: 4 P (X > 4) = 1 – P(X ≤ 4) = 1 – e x dx ≈ 0.018. 0 3) Consider the function f(x) = 1 – x2 for x in [–1, 1]. a) Show that f(x) is non-negative in [–1, 1]. Find the area A under the function f(x) in the interval [–1, 1]. 3 b) Consider the function g(x) = f(x) in [–1, 1]. 4 Show that g(x) is a probability density function in [–1, 1]. c) Consider the continuous random variable x, with probability density 1 2 function g(x). Find P(x = 0), P(x = ), P(x in [0, 1 ])and P(x in [0, 1]). 2 d) Show that the mean μ, of the continuous random variable x, with probability density function g(x) is equal to zero. Solution: a) f(-1) = f(1) = 0. For x in (–1, 1), x2 < 1 and Hence x2 - 1 < 0 → 1 - x2 > 0 Therefore f is non-negative in [–1, 1]. 1 x3 4 The area under the curve = 1 x dx = x . 3 1 3 1 1 2 b) Since f is non-negative in [–1, 1] and 0 ≤ f(x) ≤ 1 then g(x) is non3 4 negative in [–1, 1] and 0 ≤ g(x) ≤ . Furthermore the area under the function is: 1 1 3 3 3 4 Area = f ( x).dx f ( x ).dx 1. 4 4 1 4 3 1 c) P(x = 0) = 0 1 2 P(x = ) = 0 1 2 1 2 P(x in [0, ]) = The area under g(x) in [0, ] 1 2 = 31 3 1 11 4 f ( x).dx 4 2 24 32 . 0 3 3 P(x in [0, 1]) = f ( x).dx 1 x 2 dx 4 4 0 0 1 1 1 3 x3 3 1 3 2 x 1 4 3 0 4 3 4 3 1 1 2 1 3 3 x2 x4 d) x g ( x ) dx x 1 x 2 dx 0. 4 1 4 2 4 1 1 1