Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Normal Distribution The shaded area is the probability of z > 1 The normal distribution is actually a family of distributions, all with the same shape and parameterised by mean , and standard deviation . It is usually defined by a reference member of the family which is used to define other members. This reference member has =0 and =1. Definition: A random variable Z Gaussian) distribution standard deviation 1, distribution function Ф(z) is given by 1 2 z has a normal (or with mean 0 and if and only if its (defined by p(Z z) ) t2 2 e dt we write Z ~ N(0, 1) and say that Z has a standard normal distribution Definition: A random variable X has a normal (or Gaussian) distribution with mean and standard deviation , if and only if Z X ~ N (0,1) we write X ~ N(, 2) and say that X has a normal distribution The normal distribution is symmetric about its mean . In particular, if Z ~ N(0, 1), then p(Z ≤ -z) = p(Z ≥ z) i.e. Ф(-z) +Ф(z) = 1 for all z Whatever the values of and , the area between - 2 and + 2 is always 0.95 (95%). Similarly, Whatever the values of and , the area between - and + is always 0.68 (68%). Example It has been suggested IQ scores follow a normal distribution with mean 100 and standard deviation 15. Find the probability that any person chosen at random will have (a) An IQ less than 70 (b) An IQ greater than 110 (c) An IQ between 70 and 110. In R, The function dnorm gives the density of the normal distribution. Generally more useful, though, is pnorm, which gives the cumulative distribution function. So in the IQ example, the probability of an IQ less than 70 is: > pnorm(70,100,15) [1] 0.02275013 > Approximately 0.0228 And the probability of an IQ less than 110 is: > pnorm(110,100,15) [1] 0.7475075 > Thus, the probability of an IQ more than 110 is 1 - 0.7475075 > t=pnorm(110,100,15) > 1-t [1] 0.2524925 > Approximately 0.2525 Finally, for the probability of an IQ between 70 and 110, carry out a subtraction. > pnorm(110,100,15) - pnorm(70,100,15) [1] 0.7247573 > Approximately 0.7248 Alternatively, p(70 X 110) 70 100 X 100 110 100 p 15 15 15 (0.6667) ( 2) > pnorm(0.6667) - pnorm(-2) [1] 0.724768 > These are the converted variables in the standardised normal (z) scales. The answer is, of course, the same. z = -2 z =0.6667 The Central Limit Theorem Let X1, X2………. Xn be independent identically distributed random variables with mean µ and variance σ 2. Let S = X1,+ X2+ ………. +Xn Then elementary probability theory tells us that E(S) = nµ and var(S) = nσ 2 . The Central Limit Theorem (CLT) further states that, provided n is not too small, S has an approximately normal distribution with the above mean nµ, and variance nσ 2. In other words, S approx ~ N(nµ, nσ 2) The approximation improves as n increases. We will use R to demonstrate the CLT. Let X1,X2……X6 come from the Uniform distribution, U(0,1) 1 0 1 For any uniform distribution on [A,B], µ is equal to A B 2 2 ( B A ) and variance, σ2, is equal to 12 So for our distribution, µ= 1/2 and σ2 = 1/12 The Central Limit Theorem therefore states that S should have an approximately normal distribution with mean nµ (i.e. 6 x 0.5 = 3) and var nσ2 (i.e. 6 x 1/12 = 0.5) This gives standard deviation 0.7071 In other words, S approx ~ N(3, 0.70712) Generate 10 000 results in each of six vectors for the uniform distribution on [0,1] in R. > x1=runif(10000) > x2=runif(10000) > x3=runif(10000) > x4=runif(10000) > x5=runif(10000) > x6=runif(10000) > Let S = X1,+ X2+ ………. +X6 > s=x1+x2+x3+x4+x5+x6 > hist(s,nclass=20) > Consider the mean and standard deviation of S > mean(s) [1] 3.002503 > sd(s) [1] 0.7070773 > This agrees with our earlier calculations A method of examining whether the distribution is approximately normal is by producing a normal Q-Q plot. This is a plot of the sorted values of the vector S (the “data”) against what is in effect a idealised sample of the same size from the N(0,1) distribution. If the CLT holds good, i.e. if S is approximately normal, then the plot should show an approximate straight line with intercept equal to the mean of S (here 3) and slope equal to the standard deviation of S (here 0.707). > qqnorm(s) > From these plots it seems that agreement with the normal distribution is very good, despite the fact that we have only taken n = 6, i.e. the convergence is very rapid! Application Confidence Intervals for Mean Suppose that the random variables Y1,Y2, …………Yn model independent observations from a distribution with mean µ and variance σ2 . n Then 1 Y Yi n i 1 is the sample mean. Now by the CLT Y ~ N , n 2 This is because µ is replaced by µ/n and σ by σ /n (for means) Recall from Statistics 2 that, if σ2 is estimated by the sample variance, s2, an approximate confidence interval for µ is given by: s s y z , y z n n _ Here y is the observed sample mean, and z is proportional to the level of confidence required. So for 95% confidence an approximate interval for µ is given by: s s y 2 , y 2 n n 2 is approximate - an accurate value can be obtained from tables or by using the qnorm function on R. > qnorm(0.975) [1] 1.959964 > qnorm(0.995) [1] 2.575829 > qnorm(0.025) [1] -1.959964 > Thus in R, an approximate 95% confidence interval for the mean µ is given by > mean(y)+c(-1,1)*qnorm(0.975)*sqrt(var(y)/length(y)) where y is the vector of observations. A more accurate confidence interval, allowing for the fact that s2 is only an estimate of σ2,is given by use of the function t.test.