Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
8. Advanced Graphs A. Q-Q Plots B. Contour Plots C. 3D Plots 273 Quantile-Quantile (q-q) Plots by David Scott Prerequisites • Chapter 1: Distributions • Chapter 1: Percentiles • Chapter 2: Histograms • Chapter 4: Introduction to Bivariate Data • Chapter 7: Introduction to Normal Distributions Learning Objectives 1. State what q-q plots are used for. 2. Describe the shape of a q-q plot when the distributional assumption is met. 3. Be able to create a normal q-q plot. Introduction The quantile-quantile or q-q plot is an exploratory graphical device used to check the validity of a distributional assumption for a data set. In general, the basic idea is to compute the theoretically expected value for each data point based on the distribution in question. If the data indeed follow the assumed distribution, then the points on the q-q plot will fall approximately on a straight line. Before delving into the details of q-q plots, we first describe two related graphical methods for assessing distributional assumptions: the histogram and the cumulative distribution function (CDF). As will be seen, q-q plots are more general than these alternatives. Assessing Distributional Assumptions As an example, consider data measured from a physical device such as the spinner depicted in Figure 1. The red arrow is spun around the center, and when the arrow stops spinning, the number between 0 and 1 is recorded. Can we determine if the spinner is fair? 274 lots 11/7/10 1:09 PM u=0.876 0 0.9 0.1 0.8 0.2 ● 0.7 0.3 Figure 1. A physical device that gives samples from a uniform distribution. If the spinner is fair, then these numbers should follow a uniform distribution. To 0.6 0.4 investigate whether the spinner is fair, spin the arrow n times, and record the 0.5 measurements by {μ 1, μ2, ..., μn}. In this example we collect n = 100 samples. The If the spinner is fair, then these numbers should follow a uniform distribution. To investigate whether the spinner is fair, spin the arrow n times, and record the measurements by {μ , μ , ..., μ }. In this example, we collect n = 100 samples. The histogram provides a useful visualization of these uniform data. In distribution. Figure 2, we display three different histograms on a probability scale. Theahistogram should be flat for a the spinner is fair, then these numbers should follow uniform distribuFigure 2. Three If histograms of a sample of 100 uniform points. uniform butn the visual perception varies depending tion. Spinsample, the arrow times, and record the measurements by {u1 ,on u2 , whether . . . , un }. the Collect n =has 10010, samples. provides a useful of histogram 5, or 3 The bins.histogram The last histogram looks visualization flat, but the other two these data. In Figure 2, we display three di↵erent histograms on a probhistograms are not obviously flat. It is not clear which histogram we should base ability scale. The histogram should be flat for a uniform sample, but the our conclusion on. visual perception varies whether the histogram has 10, 5, or 3 bins. The last histogram provides a useful visualization of these data. In Figure 2, we display three different histograms on a probability scale. The histogram should be flat for a uniform sample, but the visual perception varies depending on whether the histogram has 10, 5, 1 2 n or 3 bins. The last histogram looks flat, but the other two histograms are not obviously Figure 1: A physical device that gives samples from a flat. It is not clear which histogram we should base our conclusion on. histogram looks flat, but the other two histograms are not obviously flat. 1.4 1.2 Probability Alternatively, we might use the cumulative distribution function (CDF), which is denoted by F(μ).1.0 The CDF gives the probability that the spinner gives a value less than .8 /onlinestatbook.com/2/advanced_graphs/q-q_plots.html Page 2 of 13 .6 .4 .2 0 0 .2 .4 .6 .8 1 0 .2 .4 u .6 u .8 1 0 .2 .4 .6 .8 1 u Figure 2: Three histograms of a sample of 100 uniform points. Figure 2. Three histograms of a sample of 100 uniform points. 2 275 Alternatively, we might use the cumulative distribution function (CDF), which is denoted by F(μ). The CDF gives the probability that the spinner gives a value less than or equal to μ, that is, the probability that the red arrow lands in the interval [0, atively, we might use the cumulative distribution function (CDF), μ]. By simple arithmetic, F(μ) = μ, which is the diagonal straight line y = x. The denoted by CDF F (u). The CDF gives data the isprobability that the based upon the sample called the empirical CDFspinner (ECDF), is denoted alue less than or equal to u, that is, the probability that the red by ds in the interval [0, u]. By simple arithmetic, F (u) = u, which is Alternatively, we might use the cumulative distribution function (CDF), nal straightwhich line isydenoted = x. by The CDF based upon the sample data is F (u). The CDF gives the probability that the spinner gives a(ECDF), value less than or equal toby u, F̂ that is, the probability that to the be red empirical CDF is denoted and is defined n (u), arrow lands in the interval [0, u]. By simple arithmetic, F (u) = u, which is and isthan defined be theto fraction of the f the data less or toequal u; that is, data less than or equal to μ; that is, the diagonal straight line y = x. The CDF based upon the sample data is called the empirical CDF (ECDF), is denoted by F̂n (u), and is defined to be ui u to u; that is, fraction of the data less#than or equal F̂n (u) = . n # ui u . n staircase appearance. F̂n (u) = , the ECDF takes on a ragged In general, general, the onon a ragged appearance. In theECDF ECDF takes a2,ragged staircase appearance. e spinner sample analyzed intakes Figure we staircase computed the ECDF and For the spinner sample analyzed in Figure 2, we computed the ECDF and For spinner analyzed in Figure 2,ECDF we computed the ECDF and ch are displayed in the Figure 3.sample Inin the left frame, appears CDF, which are displayed Figure 3. In the left the frame, the ECDF appears which are in in Figure 3. In In the left frame, the ECDF appears close to close to the line = x, middle shown the middle frame. In the right frame,we we he line y =CDF, x, shown inydisplayed the frame. the right frame, the line y = verify x,two shown the middle frame. rightquite frame, these two overlay these curvesinand verify that they In arethe indeed closewe tooverlay each ese two curves and that they are indeed quite close to each other. Observe thatthat we they do not to specify the number bins as with that we do curves and verify areneed indeed quite close to eachofother. Observe bserve that the wehistogram. do not need to specify the number of bins as with not need to specify the number of bins as with the histogram. ram. 1 Empirical CDF .8 Probability Empirical CDF Theoretical CDF Overlaid CDFs Theoretical CDF .6 Overlaid CDFs .4 .2 0 0 .2 .4 .6 .8 1 0 .2 .4 u .2 .4 .6 .6 .8 1 0 .2 .4 u .6 .8 1 u Figure 3: The empirical and theoretical cumulative distribution functions of Figure 3. The empirical and theoretical cumulative distribution functions of a sample of 100 uniform points. a sample of 100 uniform points. .8 u 1 0 .2 .4 .6 .8 1 0 u 3 .2 .4 .6 .8 1 u q-q plot for uniform data The empirical and theoretical cumulative distribution functions of The q-q plot for uniform data is very similar to the empirical cdf graphic, of 100 uniform exceptpoints. with the axes reversed. The q-q plot provides a visual comparison of 276 3 q-q plot for uniform data The q-q plot for uniform data is very similar to the empirical CDF graphic, except with the axes reversed. The q-q plot provides a visual comparison of the sample quantiles to the corresponding theoretical quantiles. In general, if the points in a qq plot depart from a straight line, then the assumed distribution is called into question. Here we define the qth quantile of a batch of n numbers as a number ξq such that a fraction q x n of the sample is less than ξq, while a fraction (1 - q) x n of the sample is greater than ξq. The best known quantile is the median, ξ0.5, which is located in the middle of the sample. Consider a small sample of 5 numbers from the spinner µ1 = 0.41, µ2 = 0.24, µ3 = 0.59, µ4 = 0.03, µ5 =0.67. Q-Q plots Based upon our description of the spinner, we expect a uniform distribution to model these data. If the sample data were “perfect,” then on average there would be an observation in the middle of each of the 5 intervals: 0 to .2, .2 to .4, .4 to .6, and so on. Table 1 shows the 5 data points (sorted in ascending order) and the theoretically expected value of each based on the assumption that the distribution is uniform (the middle of the interval). 11/7/10 1:09 P on. Table 1 shows the data points (sorted in ascending order) and the theoretically Table 1. Computing the5Expected Quantile Values. expected value of each based on the assumption that the distribution is uniform (the middle of the interval). Data (µ) Rank (i) Middle of the ith Interval Table 1. Computing the Expected Quantile Values. 0.03 Data (μ) 0.24 .03 .24 0.41 .41 0.59 .59 .67 0.67 1 Middle of the ith 0.1 Rank ( i) Interval 1 2 3 4 5 .1 .3 .5 .7 .9 2 3 4 5 0.3 0.5 0.7 0.9 The theoretical and empirical CDF's are shown in Figure 4 and the q-q plot is shown in the left frame of Figure 5. 277 Figure 4: The theoretical and empirical CDFs of a small sample of 5 uniform points, together with the expected values of the 5 points The theoretical and empirical CDFs are shown in Figure 4 and the q-q plot is shown in1the left frame of Figure 5. 1 Theoretical CDF 1 .8 Probability .8 Empirical CDF ● 1 Theoretical CDF .8 .6 ● .8 .6 .6 .6 .4 .4 1 .8 Empirical ● CDF ● .6 ● ● .4 .4 .4 ● ● .2 .2 .2 .2 .2 ● 0 .2 .4 .6 .8 0 10 u ● .2 0 ● ● ● ● .4.2 .6 .4 .8 .6 1 u u 0 ● .8 0 ● 1.2 0 ● ● .40 ● u ● ● ● ● .6 .2 .8 .4 1 u 0 .6 0 ● ● ●● .8.2 ● .4 1 ● .6 ● .8 1 u Figure of of a small sample of 5 together uniform Figure4.4:The Thetheoretical empirical and CDFempirical of a smallCDFs sample 5 uniform points, The empirical CDF of together a small sample of5 5points uniform points, together points, with values of the points dots in with the expected values ofthe theexpected (red dots in 5 the right(red frame). thethe right expected values of 5 frame). points (red dots in the right frame). In general, we consider the full set of sample quantiles to be the sorted Indata general, valueswe consider the full set of sample quantiles to be the sorted data values neral, we consider the full set of sample quantiles to be the sorted u(1) < u(2) < u(3) < · · · < u(n 1) < u(n) , es µ(1) < µ(2) < µ(3) < ··· < µ(n-1) < µ(n) , the<parentheses indicate the data have been ordered. u(1)where < u(2) u(3) < · · ·in<the u(nsubscript 1) < u(n) , Roughly speaking, we expect the first ordered value to be in the middle of where the parentheses in subscript indicate data have been ordered. e parentheses in interval the subscript thetodata been ordered. the (0, 1/n),indicate thethesecond be inhave thethe interval (1/n, 2/n), and the Roughly speaking, we expect the first ordered value to be in the middle the to be the in the middle of thevalue interval Thus we speaking, welast expect first ordered to((n be in1)/n, the1).middle of take asofthe (0, 1/n), second to be in the middle of the interval (1/n, 2/n), and the theoretical quantile theinvalue val (0, 1/n),interval the second tothebe the interval (1/n, 2/n), and the last to be in the middle of the interval ((ni - 1)/n, 1). Thus, we take as the theoretical 0.5 we e in the middle of the interval ((n 1)/n, 1). Thus take as the ⇠q = q ⇡ , quantile the value n al quantile the value where q corresponds to the ith ordered sample value. We subtract the quani 0.5 tity 0.5 so⇠qthat = qwe⇡are exactly , in the middle of the interval ((i 1)/n, i/n). These ideas are depictedn in the right frame of Figure 4 for our small sample of size n = 5. orresponds to the ith ordered sample value. We subtract the quanWeq corresponds are now prepared define thesample q-q plot precisely. First, we where to thetoithof ordered value. We subtract the compute quantity 0.5 o that we are exactly in the middle the interval ((i 1)/n, i/n). the n expected values of the data, which we pair with the n data points so thatinwe areright exactly in the middle of the interval ((i - 1)/n, i/n). These ideas are as are depicted the frame 4 for our small sample sorted in ascending order. of ForFigure the uniform density, the q-q plot is composed depicted the right frame of Figure 4 for our small sample of size n = 5. of the ninordered pairs = 5. We are now prepared to define plot precisely. First, we compute the n ◆ the q-qFirst, ✓ q-q plot e now prepared to define the precisely. we compute i 0.5 expected values of the data, ,which we pair with the n. .data u , for i = 1, 2, . , n .points sorted in (i) pected values of the data, which n we pair with the n data points ascending order. For the uniform density, the q-q plot is composed of the n ordered ascending order. For the uniform density, the q-q plot is composed This definition is slightly di↵erent from the ECDF, which includes the points pairs ordered pairs(u(i) , i/n). In the left frame of Figure 5, we display the q-q plot of the 5 ◆ ✓ i 0.5 , u(i) , for i = 1, 2, . . .5, n . n 278 We are now prepared to define the q-q plot precisely. First, we compute he n expected values of the data, which we pair with the n data points orted in ascending order. For the uniform density, the q-q plot is composed of the n ordered pairs ◆ ✓ i 0.5 , u(i) , for i = 1, 2, . . . , n . n This definition is slightly di↵erent from the ECDF, which includes the points definition slightlyofdifferent which the points (u(i)5, u(i) , i/n). This In the left is frame Figurefrom 5, the we ECDF, display theincludes q-q plot of the i/n). In the left frame of Figure 5, we display the q-q plot of the 5 points in Table 1. Inpoints the right two frames of Figure 5,frames we display the q-q plot of the same batch of in Table 1. In the right two of Figure 5, we display the q-q plot 5 numbers usedbatch in Figure 2. In theused finalin frame, we2.add linewe y =add x as a of the same of numbers Figure In the thediagonal final frame point of reference. the diagonal line y = x as a point of reference. 1 ●●● ●● ●● ●●● .8 Sample quantile ●●● ●● ●● ●●● ●● ●●● ● ● ● ●●● ●● ●● ●● ●● ●● ● ●● ●●●● ● ●● ●●●● ●●● ● ● .6 ●● ●●● ● ● ● ●●● ●● ●● ●● ● ●● ●● ● ●● ●●●● ● ●● ●●●● ●●● ● ● ● ● .4 ● .2 0 ●● ●●●● ●●● ●●● ●● ●● ●● ●●● ● ●●● ●●● ●●●●● ● ● ●● ● ●● ● ● ● ● ●● ●●● ● 0 ● ●● ●●●● ●●● ●●● ●● ●● ●● ●●● ● ●●● ●●● ●●●●● ● ● .2 .4 .6 .8 Theoretical quantile 1 0 ● ●● ● ●● ● ● ● ● ●● ●●● .2 .4 .6 .8 Theoretical quantile 1 0 .2 .4 .6 .8 Theoretical quantile 1 Figure Figure5.5:(Left) (Left)q-q q-qplot plotofofthe the55uniform uniform points. points. (Right) (Right)q-q q-q plot plot of of aa sample sample of 100 uniform points. of 100 uniform points. The sample should be taken account when judging close The sample size size should be taken intointo account when judging howhow close thethe q-q plot is to the straight line.two Weother showuniform two other uniform samples of size isq-q to plot the straight line. We show samples of size n = 10 and n = n = 10 and n = 1000 in Figure 6. Observe that the q-q plot when n = 1000 1000 in Figure 6. Observe that the q-q plot when n = 1000 is almost identical to the is almost identical to the line y = x, while such is not the case when the line y = x, while such is not the case when the sample size is only n = 10. sample size is only n = 10. 1 ● ● Sample quantile .8 ● ● .6 ● .4 ● ● ● ● .2 0 .2 .4 .6 .8 Theoretical quantile ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 .2 .4 .6 .8 Theoretical quantile 1 Figure 6: q-q plot of a sample of 10 and 1000 uniform points. What if the data are in fact not uniform? In Figure 7 we show the 279 The sample size should be taken into account when judging how close the q-q plot is to the straight line. We show two other uniform samples of size n = 10 and n = 1000 in Figure 6. Observe that the q-q plot when n = 1000 is almost identical to the line y = x, while such is not the case when the sample size is only n = 10. 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Sample quantile .8 ● ● ● .6 ● .4 ● ● ● ● .2 0 .2 .4 .6 .8 Theoretical quantile 0 .2 .4 .6 .8 Theoretical quantile 1 Figure 6. q-q plots of a sample of 10 and 1000 uniform points. Figure 6: q-q plot of a sample of 10 and 1000 uniform points. In Figure 7, we show the q-q plots of two random samples that are not uniform. In ifof the are samples in fact not In Figure we show the q-qWhat plotsexamples, two data random thatuniform? are notthe uniform. In 7both examples, both the sample quantiles match theoretical quantiles only at the the sample quantiles match the theoretical quantiles only at the median and median and at the extremes. Both samples seem to be symmetric around the 6 be symmetric around the median. But at the extremes. Both samples seem to median. But the data in the left frame are closer to the median than would be the data in the left frame are closer to the median than would be expected expected if theuniform. data were uniform. in frame the right arefrom further if the data were The data inThe thedata right areframe further thefrom the median than would expected if the datawere were uniform. median than would be be expected if the data uniform. 1 ● ● ● ● ● ● ● ● ● ● ● ● ● Sample quantile .8 .6 .4 .2 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 .2 .4 .6 .8 Theoretical quantile 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 .2 .4 .6 .8 Theoretical quantile 1 Figure 7. q-q plots of two samples of size 1000 that are not uniform. Figure 7: q-q plots of two samples of size 1000 that are not uniform. In fact, the data were generated in the R language from beta distributions with In fact, the adata generated in the beta In distributions parameters = b were = 3 on the left and a =Rblanguage = 0.4 on from the right. Figure 8 we display with parameters a = b = 3 on the left and a = b = 0.4 on the right. In Figure histograms of these two data sets, which serve to clarify the true shapes of the 8 we display histograms of these two data sets, which serve to clarify the true densities. are clearly shapes of theThese densities. These non-uniform. are clearly non-uniform. 200 200 100 Frequency Frequency 150 150 100 280 In fact, the data were generated in the R language from beta distributions with parameters a = b = 3 on the left and a = b = 0.4 on the right. In Figure 8 we display histograms of these two data sets, which serve to clarify the true shapes of the densities. These are clearly non-uniform. 200 200 Frequency Frequency 150 100 50 150 100 50 0 0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Figure 8. Histograms of the two non-uniform data sets. Figure 8: Histograms of the two non-uniform data sets. q-q plot for normal data 7 The definition of the q-q plot may be extended to any continuous density. The q-q plot will be close to a straight line if the assumed density is correct. Because the cumulative distribution function of the uniform density was a straight line, the q-q plot was very easy to construct. For data that are not uniform, the theoretical quantiles must be computed in a different manner. Let {z1, z2, ..., zn} denote a random sample from a normal distribution with mean μ = 0 and standard deviation σ = 1. Let the ordered values be denoted by z(1) < z(2) < z(3) < ... < z(n-1) < z(n). These n ordered values will play the role of the sample quantiles. Let us consider a sample of 5 values from a distribution to see how they compare with what would be expected for a normal distribution. The 5 values in ascending order are shown in the first column of Table 2. 281 Table 2. Computing the Expected Quantile Values for Normal Data. of the Two Non-Uniform Data Sets. Data (z) Rank (i) Middle of the ith Interval -1.96 1 0.1 -1.28 -0.78 2 0.3 -0.52 0.31 3 0.5 0.00 1.15 4 0.7 0.52 1.62 5 0.9 1.28 z Just as in the case of the uniform distribution, we have 5 intervals. However, with a normal distribution the theoretical quantile is not the middle of the interval but rather the inverse of the normal distribution for the middle of the interval. Taking the first interval as an example, we want to know the z value such that 0.1 of the area in the normal distribution is below z. This can be computed using the Inverse Normal Calculator as shown in Figure 9. Simply set the “Shaded Area” field to the middle of the interval (0.1) and click on the “Below” button. The result is -1.28. Therefore, 10% of the distribution is below a z value of -1.28. 282 igure 9. Example of the inverse normal calculator for finding a value of the expected quantile from a normal distribution. Figure 9. Example of the Inverse Normal Calculator for finding a value of e q-q plot for the data Table 2 isquantile shown infrom the left frame distribution. of Figure 11. theinexpected a normal In general, what should we take as the corresponding theoretical quanitiles? Let the mulative distribution function thedata normal density beshown denoted the of Figure The q-q plot forofthe in Table 2 is in by theΦ(z). left In frame 11. In general, what should we take as the corresponding theoretical quantiles? is the qth quantile of a normal distribution, then Let the cumulative distribution function of the normal density be denoted by Φ(z). In the previous example, Φ(-1.28) = 0.10 and Φ(0.00) = 0.50. Using the quantile "(#q)= q. notation, if ξq is the qth quantile of a normal distribution, then evious example, Φ(-1.28) = 0.10 and Φ(0.50) = 0.50. Using the quantile notation, if at is, the probability a normal sample is less than ξq is in fact just q. Φ(ξvalue q) =z q. Consider the first ordered (1) . What might we expect the value of Φ(z (1) ) to ? Intuitively, we expect this probability to take on a value in the interval (0, 1/n). That is, the probability a normal sample is less than ξ is in fact just ewise, we expect Φ(z (2) ) to take on a value in the interval (1/n, 2/n). qContinuing, we q. Consider the first ordered value, z(1). What might we expect the value of pect Φ(z (n) ) to fall in the interval ((n - 1)/n, 1/n). Thus the theoretical quantile we Φ(z(1)) to be? Intuitively, we expect this probability to take on a value in the sire is defined by the inverse (not reciprocal) of the normal CDF. In particular, the interval (0, 1/n). Likewise, we expect Φ(z(2)) to take on a value in the interval (1/n, om/2/advanced_graphs/q-q_plots.html Page 8 ofThus, 13 2/n). Continuing, we expect Φ(z(n)) to fall in the interval ((n - 1)/n, 1/n). the theoretical quantile we desire is defined by the inverse (not reciprocal) of the normal CDF. In particular, the theoretical quantile corresponding to the empirical quantile z(i) should be 283 he interval (1/n, 2/n). Continuing, we expect (z(n) ) to fall in the interval n 1)/n, 1/n). Thus the theoretical quantile we desire is defined by the verse (not reciprocal) of the normal CDF. In particular, the theoretical uantile corresponding to the empirical quantile z(i) should be ✓ ◆ i 0.5 1 for i = 1, 2, . . . , n . n he empirical CDF and theoretical quantile construction for the small sample ve in TableThe 2 are displayed in Figure 10.quantile For the larger sample of size 100, empirical CDF and theoretical construction for the small sample he first few given expected quantiles are -2.576, -2.170, -1.960. in Table 2 are displayed in Figure 10. Forand the larger sample of size 100, the first few expected quantiles 1are -2.576, -2.170, and1 -1.960. 1 Theoretical CDF 1.8 Probability Probability Theoretical CDF .8.6 1.8 9 Empirical CDF 1.8 .8.6 Empirical CDF .8.6 ● ● ● ● ● .6.4 .6.4 .6.4 .4.2 .4.2 .4.2 .2 0 .2 0 .2 0 ● ● ● ● −2 −1 0 z 0 −2 −1 1 2 ● 0 0 z 1 2 ● −2 ● −1 ● ● −2 ● 0 z● −1 ● 1 2 ● 0 z ● ● −2 0 ● 1 ● ● 2 ● −1 −2 ● ● −1 0 z ● ● 1 ● 0 z 2 ● 1 2 Figure 10: The empirical CDF of a small sample of 5 normal points, together Figure 10. The empirical of a small(red sample of 5the normal points, together with the valuesCDF ofCDF the dots right frame). Figure 10:expected The empirical of 5a points small sample of 5innormal points, together the expected the 5(red points dotsright in theframe). right frame). with the with expected values ofvalues the 5 of points dots(red in the In the left frame of Figure 11, we display the q-q plot of small normal In the left frame of Figure 11, we display the q-q plot of the small normal sample sample in Table 2. The11, remaining frames in Figure display the In thegiven left frame of Figure we display the q-q plot of 11 small normal given in Table 2. The remaining frames in Figure 11 display the q-q plots q-q plots of normal random samples of sizeframes n = 100 and n = Asthe theof normal sample given in Table 2. The remaining in Figure 111000. display random samples ofrandom size =samples 100 in and n =q-q1000. As sample sizeline increases, sample increases, thenpoints the liethe closer the = x the q-q plotssize of normal of size nplots = 100 and n to = 1000. Asy the in the q-q closerintothe theq-q lineplots y = x. aspoints the sample size plots n increases. sample size increases, theliepoints lie closer to the line y = x as the sample size n increases. ● 3 ● ● ●● ●●● ● n=5 32 n = 100 Sample quantile Sample quantile ● 21 ● n=5 ● ●● 0−1 ● ● −1−2 ● ● −2−3 ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● n = 100 ● ● 10 n = 1000 ● ● ●●● ●● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● n = 1000 ● ●● ●●● ● ● −3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 Theoretical quantile Figure 11. q-q plots of normal data. ● ● ● 3 −3 −2 −1 0 1 2 3 ● Theoretical quantile −3 ● ●● ●● ● 3 −3 −2 −1 0 1 2 3 Figure 11: q-q plots of normal data. As before, a normal q-q plot canplots indicate departures Figure 11: q-q of normal data.from normality. The two most common examples areq-q skewed dataindicate and data with heavy tailsnormality. (large kurtosis). As before, a normal plot can departures from The In twoAsmost common examples arecan skewed datadepartures and data with tails (large before, a normal q-q plot indicate fromheavy normality. The kurtosis). In Figure 12 we show normaldata q-q and plotsdata for with a chi-squared (skewed) two most common examples are skewed heavy tails (large data set and a Student’s-t (kurtotic) set, both size n = 1000. The kurtosis). In Figure 12 we show normaldata q-q plots for aofchi-squared (skewed) dataset were first The red data line isset, again y =ofx.size Notice particular data and a standardized. Student’s-t (kurtotic) both n = in 1000. The thatwere the data from the t distribution follow the ynormal curve in fairly closely data first standardized. The red line is again = x. Notice particular 284 4 Figure 12 we show normal q-q plots for a chi-squared (skewed) data set and a Student’s-t (kurtotic) data set, both of size n = 1000. The data were first standardized. The red line is again y = x. Notice, in particular, that the data from the t distribution follow the normal curve fairly closely until the last dozen or so points on each extreme. ● ●●● 3 4 −2 Sample quantile Sample quantile −1 ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● χ (8) ● 0 0 −2 −1 −4 −3 ● ● −4 ● −3 −3 −2 ● −2 −3 ● t5 2 0 ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● 2 1 ● t5 4 4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● 2 1 ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● 2 3 2 0 χ2(8) ● −2 −2 −1 0 −1 0 1 1 Figure 12. q-q plots for 2 3 4 Theoretical quantile −4 −2 0 2 4 3 4 −4 −2 Theoretical quantile standardized non-normal data (n = 0 ● 2 2 4 1000). Figure 12: q-q plots for standardized non-normal data (n = 1000). q-q plots for normal data with general mean and scale Figure 12:plots q-q plots standardized non-normal data (n = 1000). 5 q-q for for normal data with general mean 5 Our previous discussion of q-q plots for normal data all assumed that our data were and scale standardized. One approach to constructing q-q plots is to first standardize the data proceed as described previously. AnThe alternative constructofthe plot Whyand didthen we standardize the data in Figure 12? q-q plot isis to comprised directly the n pointsfrom raw data. ✓ section ✓ we present ◆ In this a◆general approach for data that are not i 0.5 1 standardized. Why did we standardize the in2,Figure , z(i) for data i = 1, . . . , n 12? . The q-q plot is n didcomprised we standardize the data in Figure 12? The q-q plot is comprised of the n points q-q plots for normal data with general mean and scale Why the original data {zi } are normal, but have an arbitrary mean µ and stanthe n Ifpoints dard deviation ✓ , then ✓ the line y◆= x will◆not match the expected theoretical quantile. Clearly the transformation i 0.5 1 linear , z(i) for i = 1, 2, . . . , n . n µ + · ⇠q of th provide the {z q } theoretical quantile on the transformed scale. In pracIf the would original data i are normal, but have an arbitrary mean µ and stantice, with a new data set dard deviation , then line y = xbutwill not match the theoretical If the original data the {zi} are have an arbitrary meanexpected μ and standard {xnormal, 1 , x2 , . . . , x n } , deviation then linear the line transformation y = x will not match the expected theoretical quantile. quantile. Clearlyσ,the the normal q-q plot would consist of the n points Clearly, the✓linear✓transformation ◆ ◆ 1 i 0.5 n · ⇠iq= 1, 2, . . . , n . , x(i)µ + for 285 wouldInstead provide the q ththetheoretical scale. In pracof plotting line y = x as quantile a referenceon line,the the transformed line tice, with a new data set y = x̄ + s · x µ + σ ξq would provide the qth theoretical quantile on the transformed scale. In practice, with a new data set {x1,x2,...,xn} , the normal q-q plot would consist of the n points Instead of plotting the line y = x as a reference line, the line y = M + s · x should be composed, where M and s are the sample moments (mean and standard deviation) corresponding to the theoretical moments μ and σ. Alternatively, if the data are standardized, then the line y = x would be appropriate, since now the sample mean would be 0 and the sample standard deviation would be 1. Example: SAT Case Study The SAT case study followed the academic achievements of 105 college students majoring in computer science. The first variable is their verbal SAT score and the second is their grade point average (GPA) at the university level. Before we compute inferential statistics using these variables, we should check if their distributions are normal. In Figure 13, we display the q-q plots of the verbal SAT and university GPA variables. 286 academic achievements of 105 college students. The first variable is their verbal SAT score and the second is their grade point average (GPA) at the university level. Before we compute a correlation coefficient, we should check if the distributions of the two variables are both normal. In Figure 13, we display the q-q plots of the verbal SAT and university GPA variables. ● ● Verbal SAT 700 ● ● ● 3.75 ● ●● ● ●●● ●●●●● ●● GPA 3.50 ● Sample quantile 650 3.25 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 600 ● ● ● 2.75 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●● ● ● ● ● 2.50 ●● ●● ●● ●●● ●●● ● ● 500 ● ● ● ● ● ● 3.00 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 550 ●● ● ●●● ● ●●● ●●●● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● 2.25 ● ● ●● ●●●● ●●● ●●● ● ● ● ● ● −2 −1 0 1 2 Theoretical quantile −2 −1 0 1 2 Figure 13. q-q plots for the student data (n = 105). Figure 13: q-q plots for the student data (n = 105). The verbal reasonably well, except in The verbal SAT SAT seems seems to tofollow followaanormal normaldistribution distribution reasonably well, the in extreme tails. However, the university GPA variable is highlyisnon-normal. except the extreme tails. However, the university GPA variable highly Compare the GPA q-q plot to the simulation in the right frame of Figure non-normal. Compare the GPA q-q plot to the simulation in the right frame7. These figures 7.are These very similar, for the regionexcept wherefor x ≈the -1. To follow these ideas, of Figure figuresexcept are very similar, region where x ⇡we1.computed To followhistograms these ideas, of thediagram variables ofwe thecomputed variables histograms and their scatter inand Figure 14. theirThese scatter diagram in Figure 14. These tell quite GPA a di↵erent story.with about figures tell quite a different story.figures The university is bimodal, The20% university GPA is bimodal, with about 20% of the students into scatter of the students falling into a separate cluster with a gradefalling of C. The a separate cluster with a grade of C. The scatter diagram is quite unusual. diagram is quite unusual. While the students in this cluster all have below average While the students in this cluster all have below average verbal SAT scores, verbal SAT scores, there are as low many students lowGPA’s SAT scores whose GPAs there are as many students with SAT scoreswith whose were quite were quite We might as todi↵erent the cause(s): different respectable. Werespectable. might speculate as to speculate the cause(s): majors, di↵erent distractions, different study but it wouldBut onlyobserve be speculation. study habits, but it would onlyhabits, be speculation. that the But raw observe correlation verbal SAT and verbal GPA isSAT a rather high is 0.65, but when we but that the between raw correlation between and GPA a rather high 0.65, when we exclude the cluster, the correlation for the remaining 86 students falls a 12 little to 0.59. 287 exclude the cluster, the correlation for the remaining 86 students falls a little to 0.59. 30 25 25 3.5 ● 20 15 15 10 10 5 5 0 0 gpa Frequency Frequency 20 3.0 2.5 ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ●● ● ● ● ●● ● ● ● ● ●● ● 450 500 550 600 650 700 verbal 750 ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ●● ● ● ●● ● ● ● ● ● ●●● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● 2.0 2.5 3.0 3.5 4.0 gpa 500 550 600 650 700 verbal Figure 14. Histograms and scatter diagram of the verbal SAT and GPA Figure 14: Histograms and scatter diagram of the verbal SAT and GPA for the 105 students. variablesvariables for the 105 students. Discussion 6Parametric Discussion modeling usually involves making assumptions about the shape of data, or the shape of residuals from a regression fit.assumptions Verifying such assumptions Parametric modeling usually involves making about the shape can take but anofexploration of the shape using and assumpq-q plots is very ofmany data, forms, or the shape residuals from a regression fit.histograms Verifying such tions can take forms, but exploration of the shape using effective. Themany q-q plot does notanhave any design parameters suchhistograms as the number of and are very e↵ective. The q-q plot does not have any design pabinsq-q forplots a histogram. rametersInsuch as the number of bins histogram. an advanced treatment, thefor q-qa plot can be used to formally test the null In an advanced course, the q-q plot can be used to formally test the hypothesis that the data are normal. This is done by computing the correlation null hypothesis that the data are normal. This is done by computing the coefficientcoefficient of the n points q-qin plot. upon n, the nulln,hypothesis is correlation of theinn the points theDepending q-q plot. Depending upon the rejected if the iscorrelation is less than a threshold. The threshold is null hypothesis rejected if coefficient the correlation coefficient is less than a threshold. already quite is close to 0.95 forclose modest sample The threshold already quite to 0.95 for sizes. modest sample sizes. We have seenseen thatthat the the q-q q-q plotplot for for uniform data is very close related to to the We have uniform data is very closely related the empirical cumulativedistribution distribution function.For For generaldensity densityfunctions, functions,the soempirical cumulative function. general the so-called probability integral transform a random variable andit to the called probability integral transform takes atakes random variable X and X maps maps it to the interval (0, 1) through the CDF of X itself, that is, interval (0, 1) through the CDF of X itself, that is, Y = FX(X) Y = FX (X) which has been shown to be a uniform density. This explains why the q-q plot onwhich standardized is to always close to density. the lineThis y = explains x when why the model has been data shown be a uniform the q-qisplot on correct. standardized data is always close to the line y = x when the model is correct. Finally, scientists have used special graph paper for years to make relationships 13 linear (straight lines). The most common example used to be semi-log paper, on which points following the formula y = aebx appear linear. This follows of course since log(y) = log(a) + bx, which is the equation for a straight line. The q-q plots 288 may be thought of as being “probability graph paper” that makes a plot of the ordered data values into a straight line. Every density has its own special probability graph paper. 289 Contour Plots by David Lane Prerequisites • none Learning Objectives 1. Describe a contour plot. 2. Interpret a contour plot Contour plots portray data for three variables in two dimensions. The plot contains a number of contour lines. Each contour line is shown in an X-Y plot and has a constant value on a third variable. Consider the Figure 1 that contains data on the fat, non-sugar carbohydrates, and calories present in a variety of breakfast cereals. Each line shows the carbohydrate and fat levels for cereals with the same number of calories. Note that the number of calories is not determined exactly by the fat and non-sugar carbohydrates since cereals also differ in sugar and protein. 290 10 9 8 200 7 175 Fat 6 5 4 125 3 100 2 75 1 150 150 0 10 20 30 40 50 Carbohydrates Figure 1. A contour plot showing calories as a function of fat and carbohydrates. An alternative way to draw the plot is shown in Figure 2. The areas with the same number of calories are shaded. 291 cereal contour: Contour Plot of Calories Page 1 of 2 Contour Plot for Calories 10 9 8 200 7 175 Fat 6 5 4 125 3 100 2 75 1 150 150 0 10 20 30 40 50 Carbohydrates Figure 2. A contour plot showing calories as a function of fat and carbohydrates with areas shaded. An area represents values less than or equal to the label to the right of the area. 292 3D Plots by David Lane Prerequisites • Chapter 4: Introduction to Bivariate Data Learning Objectives 1. Describe a 3D Plot. 2. Give an example of the value of a 3D plot. Just as two-dimensional scatter plots show the data in two dimensions, 3D plots show data in three dimensions. Figure 1 shows a 3D scatter plot of the fat, nonsugar carbohydrates, and calories from a variety of cereal types. Figure 1. A 3D scatter plot showing fat, non-sugar carbohydrates, and calories from a variety of cereal types. 293 Many statistical packages allow you to rotate the axes interactively to view the data from a different vantage point. Figure 2 is an example. Figure 2. An alternative 3D scatter plot showing fat, non-sugar carbohydrates, and calories. A fourth dimension can be represented as long as it is represented as a nominal variable. Figure 3 represents the different manufacturers by using different colors. 294 Figure 3. The different manufacturers are color coded. Interactively rotating 3D plots can sometimes reveal aspects of the data not otherwise apparent. Figure 4 shows data from a pseudo random number generator. Figure 4 does not show anything systematic and the random number generator appears to generate data with properties similar to those of true random numbers. 295 Figure 4. A 3D scatter plot showing 400 values of X, Y, and Z from a pseudo random number generator. Figure 5 shows a different perspective on these data. Clearly they were not generated by a random process. 296 Figure 5. A different perspective on the 3D scatter plot showing 400 values of X, Y, and Z from a pseudo random number generator. Figures 4 and 5 are reproduced with permission from R snippets by Bogumil Kaminski. 297 Statistical Literacy by David M. Lane Prerequisites • Chapter 8: Contour Plots This web page portrays altitudes in the United States. What do you think? What part of the state of Texas (North, South, East, or West) contains the highest elevation? West Texas 298 Exercises 1. What are Q-Q plots useful for? 2. For the following data, plot the theoretically expected z score as a function of the actual z score (a Q-Q plot). 0 0.5 0.8 1.3 2.1 0 0.6 0.9 1.4 2.1 0 0.6 1 1.4 2.1 0 0.6 1 1.5 2.1 0 0.6 1.1 1.6 2.1 0 0.6 1.1 1.7 2.1 0.1 0.6 1.2 1.7 2.3 0.1 0.6 1.2 1.7 2.5 0.1 0.6 1.2 1.8 2.7 0.1 0.6 1.2 1.8 3 0.1 0.7 1.2 1.9 4.2 0.2 0.7 1.2 1.9 5 0.2 0.8 1.3 2 5.7 0.3 0.8 1.3 2 12.4 0.3 0.8 1.3 2 15.2 0.4 0.8 1.3 2.1 3. For the data in problem 2, describe how the data differ from a normal distribution. 4. For the “SAT and College GPA” case study data, create a contour plot looking at College GPA as a function of Math SAT and High School GPA. Naturally, you should use a computer to do this. 5. For the “SAT and College GPA” case study data, create a 3D plot using the variables College GPA, Math SAT, and High School GPA. Naturally, you should use a computer to do this. 299