Download Lecture 10, 4 sides

Lecture 10 (MWF) Review of previous lecture • We calculated probabilities of a normal distribution by standardisation. Data Analysis and Statistical Methods Statistics 651 • Example Suppose X ∼ N (−3, 0.5), what is P (X ≤ −3.5)? √ √ ≤ −3.5+3 ) = P (Z ≤ −0.707) ≈ Standardise: P (X ≤ −3.5) = P ( X+3 0.5 0.5 √ 0.239 (by using the normal tables). We note that When we do −3.5+3 we 0.5 are going from a the nonstandard normal X ∼ N (−3, 0.5) to a standard √ normal, hence Z = −3.5+3 , where Z ∼ N (0, 1). 0.5 http://www.stat.tamu.edu/~suhasini/teaching.html Lecture 10 (MWF) Suhasini Subba Rao • We also did the reverse of this finding the values on the x-axis where P (X ≤ x) = 0.8, when X ∼ N (6, 7) (for example). √ √ ) = 0.8. Look up in ≤ x−9 In this case we had to standardise: P ( X−6 7 7 the tables the z-value that corresponds to 0.8. This is 0.85. Therefore x−9 √ = 0.85 and solve for x. 7 1 Lecture 10 (MWF) • Up until now we have assumed that the random variable X is normally distributed. • Often if we are not given any other information we may need to check to see whether this assumption is realistic. • We check this assumption based on the data we have. We will do this in this lecture. • Note: In the case of the sample mean this assumption is close to valid, thanks to the CLT, which comes later in the course. Lecture 10 (MWF) Checking for Normality (a very rough check) • Suppose x1, . . . , xn is a sample from a normal distribution with mean µ and variance σ 2. • First we order them from the smallest number to the largest number: x(1), . . . , x(n). • Estimate the mean and standard deviations from the data; x̄ and s. • Plot all the observations on a number line. Locate the mean x̄ on this line and also the intervals: [x̄ − s, x̄ + s], [x̄ − 2s, x̄ + 2s] and [x̄ − 3s, x̄ + 3s]. • If the observations came from a normal, then – Roughly 68% of the observations should lie in the interval [x̄−s, x̄+s]. 2 3 Lecture 10 (MWF) Lecture 10 (MWF) Motivating the QQplot – 95% of the observations should lie in the interval [x̄ − 2s, x̄ + 2s]. – 99.7% of the observations should lie in the interval [x̄ − 3s, x̄ + 3s]. • Remember this means counting the number of points in each interval, and dividing it by the total number of observations. • We need to find a more accurate method (which is close in idea to the counting in an interval). • This motivates the idea of the QQplot. • This is an extremely rough way to check for normality. • There can exist weird non-normal distributions where the following: – Roughly 68% of the observations should lie in the interval [x̄−s, x̄+s]. – 95% of the observations should lie in the interval [x̄ − 2s, x̄ + 2s]. – 99.7% of the observations should lie in the interval [x̄ − 3s, x̄ + 3s]. could be true! • Roughly speaking the QQplots finds average percentage of observations which should lie in the interval [x̄ − t × s, x̄ + t × s] if there were to come from a normal distribution, for lots of different values of t (not just t = 1, 2, 3 - as was done above). This is compared with the actually percentage of the true observations which lie in the interval [x̄ − t × s, x̄ + t × s], and these two number are plotted against each other (sort of). The they are close then the resulting plot should have a straightline. 4 5 Lecture 10 (MWF) Lecture 10 (MWF) Checking for normality: The QQ plot • This is the QQplot. • Its not exactly this, but similar in idea. The details are given at the end of the lecture for those who are interested (but they are not necessary for the course). • This plots what has been described above. • The QQplot consists of points and a straight 45 degree line. x=y line . . X(5) X (4) X(3) X (2) X (1) . .. y y(2) (1) y(3) y(4) y(5) • If the points tend to lie on the straightline, then this suggests the observations come from a normal distribution. 6 7 Lecture 10 (MWF) Example: Antarctic maximum temperature QQplot Lecture 10 (MWF) Example: Antarctic minimum temperature QQplot Normal Q−Q Plot −20 Sample Quantiles −30 4 −2 −40 0 2 Sample Quantiles 6 8 −10 10 0 12 Normal Q−Q Plot −3 −2 −1 0 1 2 3 −3 Theoretical Quantiles −2 −1 0 1 2 3 Theoretical Quantiles What do you think about the assumption of normality? What do you think about the assumption of normality? If it does not seem to come from a normal distribution what can we say about the distribution it comes from? 8 9 Lecture 10 (MWF) Lecture 10 (MWF) Interpretating a QQ-plot • Some experienced statisticans have shaman like powers when it comes to interpretating QQ-plots. normal distribution most the observations 98% lie within the interval [x̄ − 3s, x̄ + 3s]. For a heavy tail distribution a far smaller proportion lie in this interval. • You don’t need them, but it is good to have a feel of them. • There are two main features you need to look for; – Left Skew. This means the distribution is not symmetric. Find the mode (the heightest point of the distribution). The right of the mode should be shorter than the left of the mode. – Right Skew. This means the distribution is not symmetric. Find the mode (the heightest point of the distribution). The right of the mode should be longer than the left of the mode. – Heavy tails. This means that the probability of large numbers if much more likely than a normal distribution. For example for a 10 11 Lecture 10 (MWF) QQ-plot and skews Lecture 10 (MWF) A right skewed distribution • The above is indicates a right skewed distribution. • A right skewed distribution (red) has a long right tail (green is normal). • The points are arched, going from the below the 45 degree line across it and down again. • For a left skewed distribution the QQ-plot is the mirror image along the 45 degree line (arch going upwards and towards the left). 12 13 Lecture 10 (MWF) Lecture 10 (MWF) QQ-plot and heavy tails Heavy tail distribution • The plot is like an ‘S ′. On the left of the plot it is left of the 45 degree line and then towards the right it goes to being right of the 45 degree line. • Has much thicker tails than a normal distribution (the blue are the tails of a normal and red are the tails of a thick tail). 14 15 Lecture 10 (MWF) Lecture 10 (MWF) Transforming a distribution QQ plots and testing for normality • If the data is far from normal we often do a tranformation of it to make it ‘more normal’. • There are ‘statistical tests’ (I have not defined this yet) for checking normality. One of the most famous ones is called the KolmogorovSmirnov test. • Standard transforms are; – The log transform; Xi → log Xi = Yi. This transformation should only be done on positive observations. The variance of the transformed observation tends to be less than the variance of the original observation (sometimes this transformation is called ‘variance stablisation’). Often used when the sample mean and sample variance of X are similar. √ – The square root transform; Xi → Xi = Yi. This transformation tends to controls outliers. Very huge values are pushed down (this is also true of the log transform) – There are many other tranformations. • QQ plots for other distributions It is possible by make a QQplot for other distributions. That is to check whether the observations are drawn from another distribution of interest. The QQplot must be modified to the new distribution of interest. You could use the method I detailed below to make this new plot or do it using a statistical package. • Again the Kolmogorov-Smirnov test can be used to check whether the observations come from the distribution of interest. • The rest of these lecture notes are an optional aside, they explain how a QQplot is made. 16 17 Lecture 10 (MWF) Lecture 10 (MWF) • Therefore roughly speaking P (X ≤ X(i)) ≈ i/n. Aside: Plotting a probability (Q-Q) plots • Suppose we observe X1, . . . , Xn. • We order the observations X(1), . . . , X(n). • If we draw the relative frequency histogram with the intervals only containing each point; • To prevent the end point X(n) correspond to 100 percent, that is P (X ≤ X(n)) = 1, so preventing it from being the absolute maximum (in another sample there could be larger), we remove a small amount from the probability. That is P (X ≤ X(i)) ≈ i 1 − . n 2n 1 . This means that P (X ≤ X(n)) = 1 − 2n 1/n X1 X 2 X 3 X4 X 5 We want to check if the observations X1, . . . , Xn are from a normal distribution with mean µ and variance σ 2. X6 • The height of each point is 1/n. 18 19 Lecture 10 (MWF) Lecture 10 (MWF) This is an example of how to make a probability plot. x=y line • Let Z ∈ N (0, 1) • Then using the normal tables we can evaluate all points where P (X ≤ yi) ≤ i/n. Basically, this is done by – From the normal tables we can easily find all zi where P (Z ≤ zi) = i/n − 1/(2n) (where Z ∼ N (0, 1)). – Transform zi, such that yi = σzi + µ. . . X(5) X (4) X(3) X (2) X (1) . .. y y(2) (1) y(3) y(4) y(5) • Plot (yi, Xi). • The y-axis are the ‘true’ values associated with the normal distribution. Eg P (X ≤ yi) = i/n. • If the plot is close to linear, about the x = y line, then there is a large amount of evidence that the observations X1, . . . , Xn are from a normal distribution with mean µ and variance σ 2 • The x-axis is what we observe. • We see how well they fit through the x = y line. 20 21 Lecture 10 (MWF) Lecture 10 (MWF) Aside: example of making a QQplot plot 75% of these observations are less than or equal to 117. We observe 117, 132, 111, 107, 85, 89. We would like to know if it is a sample from a normal distribution with mean 106 and variance 258 (this is the sample mean and variance of the observations). 96% of these observations are less than or equal to 132. The idea is: • Order the observations 85, 89, 107, 111, 117, 132. • We see that just by counting approximately; 8.3% of these observations are less than or equal to 85. 25% of these observations are less than or equal to 89. 41.6% of these observations are less than or equal to 107. 58.3% of these observations are less than or equal to 111. 22 23 Lecture 10 (MWF) • What we want to do is see if for a normal distribution with mean 106 and variance 258 whether approximately: 8.3% of the curve is less than 85. Lecture 10 (MWF) 41.6% of the curve is less than y3. 58.3% of the curve is less than y4. 75% of the curve is less than y5. 25% of the curve is less than 89. 96% of the curve is less than y6. 41.6% of the curve is less than 107. • If 85 is close to y1, 89 is close to y2, 107 is close to y3, 111 is close y4, 107 is close to y5 and 132 is close to y6, then we can say the observations may have come from a normal with mean 106 and variance 258. 58.3% of the curve is less than 111. 75% of the curve is less than 117. 96% of the curve is less than 132. • We do this by plotting a graph and seeing whether they lie close to the 45 degree line. • We do this by finding the points on the normal curve where 8.3% of the curve is less than y1. • We now show how to evaluate yi. 25% of the curve is less than y2. observations order i probability pi pi=prob = percent 100 zi value on a standard normal curve P (Z ≤ zi) = pi. transformed zi : √ yi = 258 × zi + 106 24 25 Lecture 10 (MWF) Lecture 10 (MWF) 85 1 1 1 − 6 12 89 2 2 1 − 6 12 107 3 3 1 − 6 12 111 4 4 1 − 6 12 117 5 5 1 − 6 12 132 6 6 1 − 6 12 0.083 0.25 0.416 0.583 0.75 0.96 -2.4 -0.605 -0.201 0.201 0.608 1.308 Basically if Z ∼ N (0, 1). Then the numbers above imply; • P (Z ≤ −2.4) = 0.083, • P (Z ≤ −0.605) = 0.25, • P (Z ≤ −0.201) = 0.416, • P (Z ≤ 0.201) = 0.583, 73 96 103 109 116 127 We obtain zi from the normal tables. But the opposite way to how we find the probabilities. That is we look for at the probabilities, within the table and locate the zi corresponding to i/n − 1/2n. 26 • P (Z ≤ 0.608) = 0.75 and • P (Z ≤ 1.308) = 0.96. 27 Lecture 10 (MWF) Plot yi against X(i). 73 85 96 89 103 107 109 111 116 117 127 132 70 80 90 100 c(70, 140) 110 120 130 140 yi X(i) 70 80 90 100 110 120 130 140 c(70, 140) 28

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Lecture 10, 4 sides