Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Part One Exploratory Data Analysis Probability Distributions Charles A. Rohde Fall 2001 Contents 1 Numeracy and Exploratory Data Analysis 1.1 1 Numeracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Numeracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Discrete Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Stem and leaf displays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Letter Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.5 Five Point Summaries and Box Plots . . . . . . . . . . . . . . . . . . . . . . 12 1.6 EDA Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.7 Other Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 1.7.1 Classical Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 1.8 Transformations for Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . 23 1.9 Bar Plots and Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 1.9.1 Bar Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 1.9.2 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 1.9.3 Frequency Polygons . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 1.10 Sample Distribution Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 32 1.11 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 i ii CONTENTS 1.11.1 Smoothing Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 1.12 Shapes of Batches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 1.13 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2 Probability 2.1 47 Mathematical Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 2.1.1 Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 2.1.2 Counting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 2.2 Relating Probability to Responses and Populations . . . . . . . . . . . . . . 54 2.3 Probability and Odds - Basic Definitions . . . . . . . . . . . . . . . . . . . . 56 2.3.1 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 2.3.2 Properties of Probability . . . . . . . . . . . . . . . . . . . . . . . . . 57 2.3.3 Methods for Obtaining Probability Models . . . . . . . . . . . . . . . 58 2.3.4 Odds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Interpretations of Probability . . . . . . . . . . . . . . . . . . . . . . . . . . 64 2.4.1 Equally Likely Interpretation . . . . . . . . . . . . . . . . . . . . . . 64 2.4.2 Relative Frequency Interpretation . . . . . . . . . . . . . . . . . . . . 65 2.4.3 Subjective Probability Interpretation . . . . . . . . . . . . . . . . . . 65 2.4.4 Does it Matter? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 2.5.1 Multiplication Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 2.5.2 Law of Total Probability . . . . . . . . . . . . . . . . . . . . . . . . . 71 2.6 Bayes Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 2.7 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 2.8 Bernoulli trial models; the binomial distribution . . . . . . . . . . . . . . . . 81 2.4 2.5 CONTENTS 2.9 iii Parameters and Random Sampling . . . . . . . . . . . . . . . . . . . . . . . 83 2.10 Probability Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 2.10.1 Randomized Response . . . . . . . . . . . . . . . . . . . . . . . . . . 94 2.10.2 Screening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 3 Probability Distributions 3.1 3.2 Random Variables and Distributions . . . . . . . . . . . . . . . . . . . . . . 99 3.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 3.1.2 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . 101 3.1.3 Continuous or Numeric Random Variables . . . . . . . . . . . . . . . 107 3.1.4 Distribution Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 116 3.1.5 Functions of Random Variables . . . . . . . . . . . . . . . . . . . . . 117 3.1.6 Other Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 Parameters of Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 3.2.1 Expected Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 3.2.2 Variances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 3.2.3 Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 3.2.4 Other Expected Values . . . . . . . . . . . . . . . . . . . . . . . . . . 123 3.2.5 Inequalities involving Expectations . . . . . . . . . . . . . . . . . . . 125 4 Joint Probability Distributions 4.1 99 127 General Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 4.1.1 Marginal Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 128 4.1.2 Conditional Distributions . . . . . . . . . . . . . . . . . . . . . . . . 128 4.1.3 Properties of Marginal and Conditional Distributions . . . . . . . . . 129 4.1.4 Independence and Random Sampling . . . . . . . . . . . . . . . . . . 129 iv CONTENTS 4.2 The Multinomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 130 4.3 The Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . . 134 4.4 Parameters of Joint Distributions . . . . . . . . . . . . . . . . . . . . . . . . 136 4.5 4.4.1 Means, Variances, Covariances and Correlation . . . . . . . . . . . . 136 4.4.2 Joint Moment Generating Functions . . . . . . . . . . . . . . . . . . 138 Functions of Jointly Distributed Random Variables . . . . . . . . . . . . . . 139 4.5.1 Linear Combinations of Random Variables . . . . . . . . . . . . . . . 141 4.6 Approximate Means and Variances . . . . . . . . . . . . . . . . . . . . . . . 143 4.7 Sampling Distributions of Statistics . . . . . . . . . . . . . . . . . . . . . . . 145 4.8 Methods of Obtaining Sampling Distibutions or Approximations . . . . . . . 151 4.8.1 Exact Sampling Distributions . . . . . . . . . . . . . . . . . . . . . . 151 4.8.2 Asymptotic Distributions . . . . . . . . . . . . . . . . . . . . . . . . . 152 4.8.3 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 152 4.8.4 Central Limit Theorem Example . . . . . . . . . . . . . . . . . . . . 153 4.8.5 Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . 158 4.8.6 The Delta Method - Univariate . . . . . . . . . . . . . . . . . . . . . 160 4.8.7 The Delta Method - Multivariate . . . . . . . . . . . . . . . . . . . . 162 4.8.8 Computer Intensive Methods 4.8.9 . . . . . . . . . . . . . . . . . . . . . . 166 Bootstrap Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 Chapter 1 Numeracy and Exploratory Data Analysis 1.1 Numeracy 1.1.1 Numeracy Since most of statistics involves the use of numerical data to draw conclusions we first discuss the presentation of numerical data. Numeracy may be broadly defined as the ability to effectively think about and present numbers. • One of the most common forms of presentation of numerical information is in tables. • There are some simple guidelines which allow us to improve tabular presentation of numbers. • In certain situations, the guidelines presented here will need to be modified if the audience e.g. readers of a professional journal expect the results to be presented in a specified format. 1 2 CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS Guidelines • Round to two significant figures. ◦ In order to understand a table of numbers it is almost always easier to do so if the numbers do not contain too many significant figures. • Add averages or totals. ◦ Adding row and/or column averages, proportions or totals when appropriate to a table often provide a useful focus for establishing trends or patterns. • Numbers are easier to compare in columns. • Order by size. ◦ A more effective presentation is often achieved by rearranging so that the largest (and presumably most important numbers) appear first. • Spacing and layout. ◦ It is useful to present tables in single space format and not have a lot of “empty space” to detract the reader from concentrating on the numbers in the table. 1.2. DISCRETE DATA 1.2 3 Discrete Data For discrete data present tables of the numbers of responses at the various values, possibly grouped by factors. Also one can produce bar graphs and histograms for graphical presentation. Thus in the first example in the introduction we might present the results as follows: Proportion Cases Studied Placebo .008 200,745 Vaccine .004 201,229 A sensible description might be 4 cases per thousand for the vaccinated group and 8 cases per thousand for the placebo group. 4 CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS For the alcohol use data in the Overview Section eg. Group Use Alcohol Clergy 32 Educators 51 Executives 67 Merchants 83 Surveyed Proportion 300 .11 250 .20 300 .22 350 .24 we might present the data as Figure 1.1: 1.2. DISCRETE DATA 5 For the self classification data in the Overview Section e.g. Class Number Lower Working Middle 72 714 655 we might present the data as Figure 1.2: Upper 41 6 CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS 1.3 Stem and leaf displays Suppose we have a batch or collection of numbers. Stem and leaf displays provide a simple, yet informative way to • Develop summaries or descriptions of the batch either to learn about it in isolation or to compare it with other batches. The fundamental summaries are ◦ location of the batch (a center concept) ◦ scale or spread of the batch (a variability concept). • Explore (note) characteristics of the batch including ◦ symmetry and general shape ◦ exceptional values ◦ gaps ◦ concentrations 1.3. STEM AND LEAF DISPLAYS 7 Consider the following batch of 62 numbers which give the ages in years of graduate students, post-docs, staff and faculty of a large academic department of statistics: 33 33 34 37 20 22 23 28 41 42 43 44 52 55 59 64 35 36 37 37 25 26 27 29 43 43 43 44 61 61 61 64 37 37 39 40 29 30 31 31 44 46 46 49 64 65 67 74 40 40 41 51 32 50 76 32 50 79 32 51 81 52 Not much can be learned by looking at the numbers in this form. A simple display which begins to describe this collection of numbers is as follows: ( 1) 1 ( 4) 3 (12) 8 (20) 8 (42) 16 (26) 17 ( 9) 9 9 8 7 6 5 4 3 2 1 | | | | | | | | | | 1 4 1 9 2 0 9 6 4 1 1 7 7 9 5 5 4 6 3 7 2 3 3 2 4 1 3 7 9 1 2 3 2 0 1 0 0 7 5 4 0 3 6 0 1 6 4 0 9 4 2 2 2 1 5 9 4 7 1 7 6 8 Interpretation: 1 at 8 means 81, 4 at 7 means 74, 6 at 7 means 76, 9 at 7 means 79, etc. 8 CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS A more refined version of this display is: ( 1) 1 ( 4) 3 (12) 8 (20) 8 (42) 16 (26) 17 ( 9) 9 9 8 7 6 5 4 3 2 1 | | | | | | | | | 1 4 1 0 0 0 0 6 1 0 0 1 2 9 1 1 0 1 3 4 1 1 2 5 4 2 1 2 6 4 2 2 2 7 5 5 3 3 8 7 9 3 3 3 4 4 4 6 6 9 3 4 5 6 7 7 7 7 7 9 9 9 Interpretation: 1 at 8 means 81, 4 at 7 means 74, 6 at 7 means 76, 9 at 7 means 79, etc. To construct a stem and leaf display we perform the following steps: • To the left of the solid line we put the stem of the number • To the right of the solid line we put the leaf of the number. The remaining entries in the display are discussed in the next section. Note that a stem and leaf display provides a quick and easy way to display a batch of numbers. Every statistical package now has a program to draw stem and leaf displays. Some additional comments on stem and leaf displays: • Number of stems. Understanding Robust and Exploratory Data Analysis suggests for n less than 100 and 10 log10 (n) for n larger than 100. √ n (Usually more than 50 are done using a computer and each statistical package has its own default method). • Stems can be double (or more) digits and there can be stems such as 5? and 5· which divide the numbers with stem 5 into two groups (0,1,2,3,4) and (5,6,7,8,9). Large displays could use 5 or 10 divisions per stem. The important idea is to display the numbers effectively. • For small batches, when working by hand, the use of stem and leaf displays is a simple way to obtain the ordered values of the batch. 1.4. LETTER VALUES 1.4 9 Letter Values The stem and leaf display can be used to determine a collection of derived numbers, called statistics, which can be used to summarize some additional features of the batch. To do this we need determine the total size of the batch and where the individual numbers are located in the display. • To the left of the stem we count the number of leaves on each stem. • The numbers in parentheses are the cumulative numbers counting up and counting down. • Using the stem and leaf display we can easily “count in” from either end of the batch. ◦ The associated count is called the depth of the number. ◦ Thus at depth 4 we have the number 74 if we count down (largest to smallest) and the number 25 if we count up (smallest to largest). • It is easier to understand the concept of depth if the numbers are written in a column from largest to smallest. • A measure of location is provided by the median, defined as that number in the display with depth equal to 1 (1 + batch size) 2 ◦ If the size of the batch is even (n = 2m) the depth of the median will not be an integer. ◦ In such a case the median is defined to be halfway between the numbers with depth m and depth m + 1. ◦ In the example 1 63 median depth = (1 + 62) = = 31.5 2 2 thus the median is given by: 41 + 42 (# with depth 31) + (# with depth 32) = = 41.5 2 2 10 CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS ◦ The median has the property that 1/2 of the numbers in the batch are above it and 1/2 of the numbers in the batch are below it, i.e., it is halfway from either end of the batch. • The median is just one example of a letter value. Other letter values enable us to describe variability, shape and other characteristics of the batch. ◦ The simplest sequence of letter values divides the lower half in two and the upper half in two, each of these halves in two, and so on. ◦ To obtain these letter values we first find their depths by the formula next letter value depth = 1 (1 + [previous letter value depth]) 2 where [ ] means we discard any fraction in the calculation. (Called the “floor function”). ◦ Thus the upper and lower quartiles have depths equal to 1 (1 + [depth of median]) 2 The quartiles are sometimes called fourths. ◦ The eighths have depths equal to 1 (1 + [depth of hinge]) 2 ◦ We proceed down to the extremes which have depth 1. ◦ The median, quartiles and extremes often describe a batch of numbers quite well. ◦ The remaining letter values are used to describe more subtle features of the data (illustrated later). In the example we thus have 1 32 (1 + 31) = = 16 2 2 1 17 E depth = (1 + 16) = = 8.5 2 2 1 2 Extreme depth = (1 + 1) = = 1 2 2 The corresponding letter values are F depth = 1.4. LETTER VALUES 11 M 41.5 F 33 Ex 20 52 81 depth 31.5 depth 16 depth 1 We can display the letter values as follows: Value Depth M 31.5 F 16 E 8.5 Ex 1 Lower Upper 41.5 41.5 33 52 29 64 20 81 Spread 0 19 35 61 where the spread of a letter value is defined as: upper letter value − lower letter value 12 CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS 1.5 Five Point Summaries and Box Plots • A useful summary of a batch of numbers is the five point summary in which we list the upper and lower extremes, the upper and lower hinges and the median. Thus for the example we have the five point summary given by 20, 33, 41.5, 52, 81 • A five point summary can be displayed graphically as a box plot in which we picture only the median, the lower fourth, the upper fourth and the extremes as on the following page: 1.5. FIVE POINT SUMMARIES AND BOX PLOTS 13 For this batch of numbers there is evidence of asymmetry or skewness as can be observed from the stem-leaf display or the box plot. Figure 1.3: To measure spread we can use the interquartile range which is simply the diference between the upper quartile and the lower quartile. 14 1.6 CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS EDA Example The following are the heights in centimeters of 351 elderly female patients. The data set is elderly.raw (from Hand et. al. pages 120-121) 156 150 156 155 164 160 156 153 159 157 167 166 151 157 155 160 158 164 157 169 154 157 169 155 158 150 163 158 161 160 163 164 156 161 163 162 151 162 159 163 162 159 171 158 161 161 164 145 158 163 158 167 154 168 167 165 166 155 155 152 169 159 153 158 164 155 165 163 158 166 153 157 162 153 156 167 163 153 168 164 162 142 155 152 164 165 162 168 158 153 161 157 178 163 157 160 169 162 160 165 156 152 158 155 153 162 155 169 161 150 164 166 167 165 170 147 163 160 161 154 166 161 158 152 151 157 164 165 155 163 159 152 161 156 158 155 160 165 154 158 163 164 158 164 162 160 153 163 156 163 164 162 154 163 152 155 152 151 157 166 157 160 158 163 158 159 167 165 165 163 170 162 166 165 162 163 157 163 153 158 163 173 160 164 155 157 157 147 160 162 160 164 147 165 159 158 158 158 151 174 173 170 158 153 161 156 164 161 158 152 154 165 166 161 149 156 163 157 168 170 160 153 176 163 158 161 156 163 155 154 160 145 168 145 152 156 170 162 173 162 166 160 162 169 160 161 153 155 163 157 155 158 148 161 156 162 153 157 167 148 150 163 161 156 166 159 160 159 163 178 165 156 154 170 161 159 155 153 158 159 155 171 160 171 160 157 170 158 168 164 160 166 165 177 170 150 154 163 153 163 169 146 158 153 156 155 159 157 156 150 158 163 163 164 159 159 159 151 161 165 154 159 158 157 162 155 165 160 158 159 156 165 155 152 161 169 156 161 154 158 163 170 165 152 170 152 153 157 156 147 170 1.6. EDA EXAMPLE STATA log for EDA of Heights of Elderly Women . infile height using c:\courses\b651201\datasets\elderly.raw (351 observations read) . stem height Stem-and-leaf plot for height 14t 14f 14s 14. 15* 15t 15f 15s 15. 16* 16t 16f 16s 16. 17* 17t 17f 17s 17. | | | | | | | | | | | | | | | | | | | 2 555 67777 889 000000111111 22222222222233333333333333333 44444444444555555555555555555555 6666666666666666666677777777777777777777 888888888888888888888888888888899999999999999999 00000000000000000000011111111111111111111 222222222222222222333333333333333333333333333333 44444444444444444555555555555555555 666666666667777777 88888899999999 00000000000111 333 4 67 88 15 16 CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS . summarize height, detail height ------------------------------------------------------------Percentiles Smallest 1% 145 142 5% 150 145 10% 152 145 Obs 351 25% 156 145 Sum of Wgt. 351 50% 75% 90% 95% 99% 160 164 168 170 176 Largest 176 177 178 178 Mean Std. Dev. 159.7749 6.02974 Variance Skewness Kurtosis 36.35777 .1289375 3.160595 . display 3.49*6.02974*(351^(-1/3)) 2.9832408 . display 3.49*sqrt(r(Var))*(351^(-1/3)) 2.983241 . display (178-142)/2.98 12.080537 . display min(sqrt(351),10*log(10)) 18.734994 1.6. EDA EXAMPLE 17 . graph height, normal xlabel ylabel ti(Heights of Elderly Women 5 Bins) . graph height, normal xlabel ylabel ti(Heights of Elderly Women 5 Bins) saving > (g1,replace) . graph height, bin(12) normal xlabel ylabel ti(Heights of Elderly Women 5 Bins > ) saving(g2,replace) . graph height, bin(12) normal xlabel ylabel ti(Heights of Elderly Women 12 Bin > s) saving(g2,replace) . graph height, bin(18) normal xlabel ylabel ti(Heights of Elderly Women 18 Bin > s) saving(g3,replace) . graph height, bin(25) normal xlabel ylabel ti(Heights of Elderly Women 25 Bin > s) saving(g4,replace) . graph using g1 g2 g3 g4 18 CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS Histograms of Data on Elderly Women Figure 1.4: Histograms 1.6. EDA EXAMPLE 19 . lv height # 351 M F E D C B A Z Y 176 88.5 44.5 22.5 11.5 6 3.5 2 1.5 1 inner fence outer fence height --------------------------------| 160 | | 156 160 164 | | 153 159.5 166 | | 151 160.25 169.5 | | 148.5 159.5 170.5 | | 147 160 173 | | 145 160.75 176.5 | | 145 161.5 178 | | 143.5 160.75 178 | | 142 160 178 | | | | 144 176 | | 132 188 | spread 8 13 18.5 22 26 31.5 33 34.5 36 # below 1 0 pseudosigma 5.95675 5.667454 6.048453 5.929273 6.071367 6.659417 6.360923 6.355203 6.246375 # above 4 0 spread 8.00 13.00 18.50 22.00 26.00 31.50 33.00 34.50 36.00 # below 1 0 pseudosigma 5.96 5.67 6.05 5.93 6.07 6.66 6.36 6.36 6.25 # above 4 0 . format height %9.2f . lv height # 351 M F E D C B A Z Y 176 88.5 44.5 22.5 11.5 6 3.5 2 1.5 1 inner fence outer fence height --------------------------------| 160.00 | | 156.00 160.00 164.00 | | 153.00 159.50 166.00 | | 151.00 160.25 169.50 | | 148.50 159.50 170.50 | | 147.00 160.00 173.00 | | 145.00 160.75 176.50 | | 145.00 161.50 178.00 | | 143.50 160.75 178.00 | | 142.00 160.00 178.00 | | | | 144.00 176.00 | | 132.00 188.00 | 20 CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS . graph height, box . graph height, box ylabel . graph height, box ylabel l1(Height in Centimeters) ti(Box Plot of Heights of > Elderly Women) . cumul height, gen(cum) . graph cum height,s(i) c(l) ylabel xlabel ti(Empirical Distribution Function O > f Heights of Elderly Women) rlabel yline(.25,.5,.75) . kdensity height . kdensity height,normal ti(Kdensity Estimate of Heights) . log close 1.7. OTHER SUMMARIES 1.7 21 Other Summaries Other measures of location are • mid = 12 (UQ + LQ) + UQ where UQ is the upper quartile, M • tri-mean = 12 (mid + median) = LQ + 2M 4 is the median and LQ is the lower quartile. It is often useful to identify exceptional values that need special attention. We do this using fences. • The upper and lower fences are defined by upper fence = UF lower fence = LF = upper hinge + 32 (H-spread) = lower hinge − 32 (H-spread) • Values above the upper fence or below the lower fence can be considered as exceptional values and need to be examined closely for validity. 22 CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS 1.7.1 Classical Summaries The summary quantities developed in the previous sections are examples of statistics, formally defined as functions of a sample data set. There are other summary measures of a sample data set. • For location, the traditional summary measure is the sample mean defined by x̄ = n 1X xi n i=1 where n is the number of observations in the data set and (x1 , x2 , . . . , xn ) is the sample data set. • For spread or variablity the sample variance, s2 , and the sample standard deviation, s, are defined by n √ 1 X s2 = (xi − x̄)2 and s = s2 n − 1 i=1 • Note that where x̄i−1 µ ¶ 1 1 x̄(i−1) + xi x̄ = 1 − n n is the sample mean of the data set with the ith observation removed. ◦ It follows that a single observation can greatly influence the magnitude of the sample mean which explains why other summaries such as the median or tri-mean for location are often used. ◦ Similarly the sample variance and sample standard deviation are greatly influenced by single observations. • For distributions which are “bell-shaped” the interquartile range is approximately equal to 1.34 s to where s is the sample standard deviation. 1.8. TRANSFORMATIONS FOR SYMMETRY 1.8 23 Transformations for Symmetry Data can be easier to understand if it is nearly symmetric and hence we sometimes transform a batch to make it approximately symmetric. The reasons for transformations are: • For symmetric batches we have an unambiguous measure of center (the mean or the median). • Transformed data may have a scientific meaning. • Many statistical methods are more reliable for symmetric data. As examples of transformed data with scientific meaning we have • For income and population changes the natural logarithm is often useful since both money and poulations grow exponentially i.e. Nt = N0 exp(rt) where r is the interest rate or growth rate. • In measuring consumption e.g. miles per gallon or BTU per gallon the reciprocal is a measure of power. The fundamental use of transformations is to change shape which can be loosely described as everything about the batch other than location and scale. Desirable features of a transformation is to preserve order and be a simple and smooth function of the data. We first note that a linear transformation does not change shape, it only changes the location and center of the batch since t(yi ) = a + byi , t(yj ) = a + byj =⇒ t(yi ) − t(yj ) = b(yi − yj ) shows that a linear transformation does not change the relative distances between observations. Thus a linear transformation does not change the shape of the batch. To choose a transformation for symmetry we first need to determine whether the data are skewed right or skewed left. A simple way to do this is to examine the “mid-list” defined as lower letter value + upper letter value mid letter value = 2 24 CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS If the values in the mid-list increase as the letter values increase then the batch is skewed right. Conversely if the values in the mid-list decrease as the letter values increase the batch is skewed left. A convenient collection of transformations is the power family of transformations defined by ( y k k 6= 0 tk (y) = ln(y) k = 0 For this family of transformations we have the following ladder of re-expression or transformation: k tk (y) 2 y2 1 y √ 1 y 2 0 ln(y) √ 1 − 2 −1/ y −1 −1/y −2 −1/y 2 The rule for using this ladder is to start at the transfomation where k = 1. If the data are skewed to high values, go down the ladder to find a transformation. If skewed towards low values of y go up the ladder. For the data set on ages the complete set of letter vales as produced by STATA is # M F E D C B 62 31.5 16 8.5 4.5 2.5 1.5 1 inner fence outer fence y --------------------------------| 41.5 | | 33 42.5 52 | | 29 46.5 64 | | 25.5 48 70.5 | | 22.5 50 77.5 | | 21 50.5 80 | | 20 50.5 81 | | | | | | 4.5 80.5 | | -24 109 | spread 19 35 45 55 59 61 # below 0 0 # above 1 0 1.8. TRANSFORMATIONS FOR SYMMETRY 25 Thus the mid-list is mid letter value 41.5 median 42.5 fourth 46.5 eighth 48 D 50 B 50.5 A 50.5 Extreme Since the values increase we need to go down the ladder. Hence we try square roots or natural logarithms first. Note: There are some rather sophisticated symmetry plots now available. e.g. STATA has a command symplot which determines the value of k. Often, however this results in k = .48 or k = .52. Try to choose a k which is simple e.g. k = 1/2 and hope for a scientific justification. 26 CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS Here are the stem and leaf plots of the natural logarithm and square root of the age data 30* | 31* | 32* | 33* | 34* | 35* | 36* | 37* | 38* | 39* | 40* | 41* | 42* | 43* | lnage 09 4 26 0377 033777 00368 111116999 1146666888 339 113355 18 1116667 0 0379 4** | 47 4** | 69 4** | 80 5** | 00,10 5** | 20,29,39,39 5** | 48,57,57 5** | 66,66,66,74,74 5** | 83,92 6** | 00,08,08,08,08,08 6** | 24,32,32,32 6** | 40,40,48,56,56,56,56 6** | 63,63,63,78,78 6** | 7** | 00,07,07,14,14 7** | 21,21 7** | 42 7** | 68 7** | 81,81,81 8** | 00,00,00,06,19 8** | 8** | 8** | 60,72 8** | 89 9** | 00 square root of age 1.9. BAR PLOTS AND HISTOGRAMS 1.9 27 Bar Plots and Histograms Two other useful graphical displays for describing the shape of a batch of data are provided by bar plots and histograms. 1.9.1 Bar Plots • Barplots are very useful for describing relative proportions and frequencies defined for different groups or intervals. • The key concept in constructing bar plots is to remember that the plot must be such that the area of the bar is proportional to the quantity being plotted. • This causes no problems if the intervals are of equal length but presents real problems if the intervals are not of equal length. • Such incorrect graphs are examples of “lying graphics” and must be avoided. 1.9.2 Histograms • Histograms are similar to bar plots and are used to graph the proportion of data set values in specified intervals. • These graphs give insight into the distributional patterns of the data set. • Unlike stem-leaf plots, histograms sacrifice the individual data values. • In constructing histograms the same basic principle used in constructing bar plots applies: the area over an interval must be proportional to the number or proportion of data values in the interval. The total area is often scaled to be one. • Smoothed histograms are available in most software packages. (more later when we discuss distributions). The following pages show the histogram of the first data set of 62 values with equal intervals and the kdensity graph. 28 CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS Histogram Figure 1.5: 1.9. BAR PLOTS AND HISTOGRAMS Smoothed histogram Figure 1.6: 29 30 CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS 1.9.3 Frequency Polygons • Closely related to histograms are frequency polygons in which the proportion or frequency of an interval is plotted at the mid point of the interval and the resulting points connected. • Frequency polygons are also useful in visualizing the general shape of the distribution of a data set. Here is a small data set giving the number of reported suicide attempts in a major US city in 1971: Age 6-15 16-25 Frequency 4 28 26-35 16 36-45 8 46-55 4 56-65 1 1.9. BAR PLOTS AND HISTOGRAMS The frequency polygon for this data set is as follows: Figure 1.7: 31 32 CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS 1.10 Sample Distribution Functions • Another useful graphical display is the sample distribution function or empirical distribution function which is a plot of the proportion of values less than or equal to y versus y where y represents the ordered values of the data set. • These plots can be conveniently made using current software but usually involve too much computation to be done by hand. • They represent a very valuable technique for comparing observed data sets to theoretical models as we will see later. 1.10. SAMPLE DISTRIBUTION FUNCTIONS Here is the sample distribution function for the first data set on ages. Figure 1.8: 33 34 CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS 1.11 Smoothing Time series data of the form yt : t = 0, 1, 2, . . . , n which we abbreviate to {yt } can usefully be separate d into two additive parts: {zt } and {rt } where • {zt } is the smooth or signal and represents that part of the data which is slowly varying and structured. • {rt } is the rough or noise and represents that part of the data which is rapidly varying and unstructured. {zt }, the smooth, tells us about long-run patterns while {rt }, the roughh, tells us about exceptional points. The operator which converts the data {yt } into the smooth is called a data smoother. The smoothed data may then be written as Sm{yt }. The corresponding rough is then given by Ro{yt } = {yt } − Sm{yt } There are many smoothers, defined by their properties. For our purposes two general types are important: • Linear smoothers defined by the property Sm{axt + byt } = aSm{xt } + bSm{yt } • Semi-linear smoothers defined by the property Sm{ayt + b} = aSm{yt } + b Examples of linear smoothers include moving averages e.g. Sm{yt } = yt−1 + yt + yt+1 3 and weighted moving averages such as Hanning defined by 1 1 1 Sm{yt } = yt−1 + yt + yt+1 4 2 4 (Special adjustments are made at the ends of the series. 1.11. SMOOTHING 35 Examples of semi-linear smoothers include running medians of length 3 or 5 when smoothing without a computer or even lengths if using a statistical package with the right programs. e.g. Sm{yt } = med{yt−1 , yt , yt+1 } is a smoother of running medians of length 3 with the ends replicated (copied). These kinds of smoothers are applied several times until they “settle down”. Then end adjustments are made. The two basic types of smoothers are usually combined to form compound smoothers. The nomenclature for these smoothers is rather bewildering at first but informative: e.g. 3RSSH,twice refers to the smoother which • takes running medians of length 3 until the series stabilizes (R) • the S refers to splitting the repeated values, using the endpoint operator on them and then replaces the original smooth with these values • H applies the Hanning smoother to the series which remains • twice refers to using the smoother on the rough and then adding the rough back to the smooth to form the final smoothed version A little trial and error is needed in using these smoothers. Velleman has recommended the smoother 4253H,twice for general use. 36 1.11.1 CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS Smoothing Example To illustrate the smoothing techniques we use data on unemployment percent for the years 1960 to 1990. . infile year unempl using c:\courses\b651201\datasets\unemploy.raw (31 observations read) . smooth 3 unempl, gen(sm1) . smooth 3 sm1, gen(sm2) . smooth 3R unempl, gen(sm3) . smooth 3RE unempl, gen(sm4) . smooth 4253H,twice unempl, gen(sm5) . gen sm5r=round(sm5,.1) 1.11. SMOOTHING 37 . list year unempl sm1 sm2 sm3 sm4 year 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 unempl 4.9 6 4.9 5 4.6 4.1 3.3 3.4 3.2 3.1 4.4 5.4 5 4.3 5 7.8 7 6.2 5.2 5.1 6.3 6.7 8.6 8.4 6.5 6.2 6 5.3 4.7 4.5 4.1 sm1 4.9 4.9 5 4.9 4.6 4.1 3.4 3.3 3.2 3.2 4.4 5 5 5 5 7 7 6.2 5.2 5.2 6.3 6.7 8.4 8.4 6.5 6.2 6 5.3 4.7 4.5 4.1 sm2 4.9 4.9 4.9 4.9 4.6 4.1 3.4 3.3 3.2 3.2 4.4 5 5 5 5 7 7 6.2 5.2 5.2 6.3 6.7 8.4 8.4 6.5 6.2 6 5.3 4.7 4.5 4.1 sm3 4.9 4.9 4.9 4.9 4.6 4.1 3.4 3.3 3.2 3.2 4.4 5 5 5 5 7 7 6.2 5.2 5.2 6.3 6.7 8.4 8.4 6.5 6.2 6 5.3 4.7 4.5 4.1 sm4 4.9 4.9 4.9 4.9 4.6 4.1 3.4 3.3 3.2 3.2 4.4 5 5 5 5 7 7 6.2 5.2 5.2 6.3 6.7 8.4 8.4 6.5 6.2 6 5.3 4.7 4.5 4.1 38 CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS . list year unempl sm5r year 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 unempl 4.9 6 4.9 5 4.6 4.1 3.3 3.4 3.2 3.1 4.4 5.4 5 4.3 5 7.8 7 6.2 5.2 5.1 6.3 6.7 8.6 8.4 6.5 6.2 6 5.3 4.7 4.5 4.1 sm5r 4.9 5 5 4.9 4.6 4 3.6 3.4 3.4 3.6 4.1 4.6 4.8 5.1 5.5 6 6.2 6.1 5.9 5.8 6.2 7 7.4 7.3 7 6.4 5.8 5.3 4.8 4.4 4.1 1.11. SMOOTHING 39 . graph unempl sm4 year,s(oi) c(ll) ti(Unemployment and 3RE Smooth) xlab . graph unempl sm5r year,s(oi) c(ll) ti(Unemployment and 4253H,twice > lab Smooth) x . log close The graphs on the following two pages show the smoothed versions and the original data. 40 CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS Graph of Unemployment Data and 3RE smooth Figure 1.9: 1.11. SMOOTHING 41 Graph of Unemployment Data and 4253H,twice Smooth. Figure 1.10: 42 1.12 CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS Shapes of Batches Figure 1.11: 1.13. REFERENCES 1.13 43 References 1. Bound, J. A. and A. S. C. Ehrenberg (1989). Significant Sameness. J. R. Statis. Soc. A 152(Part 2): pp. 241-247. 2. Chakrapani, C. Numeracy. Encyclopedia of Statistics. 3. Chambers, J. M., W. S. Cleveland, et al. (1983). Graphical Methods for Data Analysis, Wadsworth International Group. 4. Chatfield, C. (1985). The Initial Examination of Data. J.R.Statist. Soc. A 148(3): 214-253. 5. Cleveland, W. S. and R. McGill (1984). The Many Faces of a Scatterplot. JASA 79(388): 807-822. 6. Doksum, K. A. (1977). Some Graphical Methods in Statistics. Statistica Neelandica Vol. 31(No. 2): pp. 53-68. 7. Draper, D., J. S. Hodges, et al. (1993). Exchangability and Data Analysis. J. R. Statist. Soc. A 156(Part 1): pp. 9-37. 8. Ehrenberg, A. S. C. (1977). Graphs or Tables ? The Statistician Vol. 27(No.2): pp. 87-96. 9. Ehrenberg, A. S. C. (1986). Reading a Table: An Example. Applied Statistics 35(3): 237-244. 10. Ehrenberg, A. S. C. (1977). Rudiments of Numeracy. J. R. Statis. Soc. A 140(3): 277-297. 11. Ehrenberg, A. S. C. Reduction of Data. Johnson and Kotz. 12. Ehrenberg, A. S. C. (1981). The Problem of Numeracy. American Statistician 35(3): 67-71. 13. Finlayson, H. C. The Place of ln x Among the Powers of x. American Mathematical Monthly: 450. 14. Gan, F. F., K. J. Koehler, et al. (1991). Probability Plots and Distribution Curves for Assessing the Fit of Probability Models. American Statistician 45(1): 14-21. 44 CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS 15. Goldberg, K. and B. Iglewicz (1992). Bivariate Extensions of the Boxplot. Technometrics 34(3): 307-320. 16. Hand, D. J. (1996). Statistics and the Theory of Measurement. J. R. Statist. Soc. A 159(Part 3): pp. 445-492. 17. Hand, D. J. (1998). Data Mining: Statistics and More? American Statistics 52(2): 112-118. 18. Hoaglin, D. C., F. Mosteller, et al. (1991). Fundamentals of Exploratory Analysis of Variance, John Wiley & Sons, Inc. 19. Hoaglin, D. C., F. Mosteller, et al., Eds. (1983). Understanding Robust and Exploratory Data Analysis, John Wiley & Sons, Inc. 20. Hunter, J. S. (1988). The Digidot Plot. American Statistician 42(1): 54. 21. Hunter, J. S. (1980). The National System of Scientific Measurement. Science 210: 869-874. 22. Kafadar, K. Notched Box-and-Whisker Plots. Encyclopedia of Statistics. Johnson and Kotz. 23. Kruskal, W. (1978). Taking Data Seriously. Toward a Metric of Science, John Wiley & Sons: 139-169. 24. Mallows, C. L. and D. Pregibon (1988). Some Principles of Data Analysis, Statistical Research Reports No. 54 AT&T Bell Labs. 25. McGill, R., J. W. Tukey, et al. (1978). Variations of Box Plots. American Statistician 32(1): 12-16. 26. Mosteller, F. (1977). Assessing Unknown Numbers: Order of Magnitude Estimation. Statistical Methods for Policy Analysis. W. B. Fairley and F. Mosteller, AddisonWesley. 27. Paulos, J. A. (1988). Innumeracy: Mathematical Illiteracy and Its Consequences, Hill and Wang. 28. Paulos, J. A. (1991). Beyond Numeracy: Ruminations of a Numbers Man, Alfred A. Knopf. 1.13. REFERENCES 45 29. Preece, D. A. (1987). The language of size, quantity and comparison. The Statistician 36: 45-54. 30. Rosenbaum, P. R. (1989). Exploratory Plots for Paired Data. American Statistician 43(2): 108-109. 31. Scott, D. W. (1979). On optimal and data-based histograms. Biometrika 66(3): pp. 605-610. 32. Scott, D. W. (1985). Frequency Polygons: Theory and Applications. JASA 80(390): 348-354. 33. Sievers, G. L. Probability Plotting. Encyclopedia of Statistics. Johnson and Kotz: 232-237. 34. Snee, R. D. and C. G. Pfeifer.. Graphical Representation of Data. Encyclopedia of Statistics. Johnson and Kotz: 488-511. 35. Stevens, S. S. (1968). Measurement, Statistics and the Schemapric View. Science 161(3844): 849-856. 36. Stirling, W. D. (1982). Enhancements to Aid Interpretation of Probablity Plots. The Statistician 31(3): 211. 37. Sturges, H. A. (1926). The Choice of Class Interval. JASA 21: 65-66. 38. Terrell, G. R. and D. W. Scott (1985). Oversmoothed Nonparametric Density Estimates. JASA 80(389): 209-213. 39. Tukey, J. W. (1980). We Need Both Exploratory and Confirmatory. American Statistician 34(1): 23-25. 40. Tukey, J. W. (1986). Sunset Salvo. American Statistician 40(1): 72-76. 41. Tukey, J. W. (1977). Exploratory Data Analysis, Addison Wesley. 42. Tukey, J. W. and C. L. Mallows An Overview of Techniques of Data Analysis, Emphasizing Its Exploratory Aspects: 111-172. 43. Velleman, P. F. Applied Nonlinear Smoothing. Sociological Methodology 1982 San Francisco: Jossey-Bass 46 CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS 44. Velleman, P. F. and L. Wilkinson (1993). Nominal, Ordinal, Interval, and Ratio Typologies Are Misleading. American Statistician 47(1): 65-72. 45. Wainer, H. (1997). Improving Tabular Displays, With NAEP Tables as Examples and Inspirations. Journal of Educational and Behavioral Statistics 22(1): 1-30. 46. Wand, M. P. (1997). Data-Based Choice of Histogram Bin Width. American Statistician Vol. 51(No. 1): pp. 59-64. 47. Wilk, M. B. and R. Gnanadesikian (1968). Probability plotting methods for the analysis of data. Biometrika 55(1): 1-17. Chapter 2 Probability 2.1 2.1.1 Mathematical Preliminaries Sets To study statistics effectively we need to learn some probability. There are certain elementary mathematical concepts which we use to increase the precision of our discussions. The use of set notation provides a convenient and useful way to be precise about populations and samples. Definition: A set is a collection of objects called points or elements. Examples of sets include: • set of all individuals in this class • set of all individuals in Baltimore • set of integers including 0 i.e. {0, 1, . . .} • set of all non-negative numbers i.e. [0, +∞) • set of all real numbers i.e. (−∞, +∞) 47 48 CHAPTER 2. PROBABILITY To describe the contents of a set we will follow one of two conventions: • Convention 1: Write down all of the elements in the set and enclose them in curly brackets. Thus the set consisting of the four numbers 1, 2, 3 and 4 is written as {1, 2, 3, 4} • Convention 2: Write down a rule which determines or defines which elements are in the set and enclose the result in curly brackets. Thus the set consisting of the four numbers 1, 2, 3 and 4 is written as {x : x = 1, 2, 3, 4} and is read as “the set of all x such that x = 1, 2, 3, or 4”. The general convention is thus {x : C(x)} and is read as “the set of all x such that the condition C(x) is satisfied”. Obviously convention 2 is more useful for complicated and large sets. 2.1. MATHEMATICAL PRELIMINARIES 49 Notation and Definitions: • x ∈ A means that the point x is a point in the set A • x 6∈ A means that the point x is not a point in the set A Thus 1 ∈ {1, 2, 3, 4} but 5 6∈ {1, 2, 3, 4} • A ⊂ B means that each a ∈ A implies that a ∈ B. A. Such an A is said to be a subset of B. Thus {1, 2} ⊂ {1, 2, 3, 4} • A = B means that every point in A is also in B and conversely. More precisely A = B means that A ⊂ B and B ⊂ A. • The union of two sets A and B is denoted by A ∪ B and is the set of all points x which are in at least one of the sets. Thus if A = {1, 2} and B = {2, 3, 4} then A ∪ B = {1, 2, 3, 4} • The intersection of two sets A and B is denoted by A∩B and is the set of all points x which are in both of the sets. Thus if A = {1, 2} and B = {2, 3, 4} then A ∩ B = {2}. • If there are no points x which are in both A and B we say that A and B are disjoint or mutually exclusive and we write A∩B =∅ where ∅ is called the empty set (the set containing no points). 50 CHAPTER 2. PROBABILITY • Each set under discussion is usually considered to be a subset of a larger set Ω called the sample space. • The complement of a set A, Ac is the set of all points not in A i.e. Ac = {x : x 6∈ A} Thus if Ω = {1, 2, 3, 4, 5} and A = {1, 2, 4} then Ac = {3, 5}. • If B ⊂ A then A − B = A ∩ B c = {x : x ∈ A ∩ B c } • If a and b are elements or points we call (a, b) an ordered pair. a is called the first coordinate and b is called the second coordinate. Two ordered pairs are equal defined to be equal if and only if both their first and second coordinates are equal. Thus (a, b) = (c, d) if and only if a = c and b = d Thus if we record for an individual their blood pressure and their age the result may be written as (age, blood pressure). • The Cartesian product of two sets A and B is written as A × B and is the set of all ordered pairs having as first coordinate an element of A and second coordinate an element of B. More precisely A × B = {(a, b) : a ∈ A; b ∈ B} Thus if A = {1, 2, 3} and B = {3, 4} then A × B = {(1, 3), (1, 4), (2, 3), (2, 4), (3, 3), (3, 4)} • Extension of Cartesian products to three or more sets is useful. Thus A1 × A2 × A3 = {(a1 , a2 , a3 ) : a1 ∈ A1 , a2 ∈ A2 , a3 ∈ A3 } defines a set of triples. Two triples are equal if and only if they are equal coordinatewise. Most computer based storage systems (data base programs) implicitly use Cartesian products to label and store data values. • An n tuple is an ordered collection of n elements of the form a1 , a2 , . . . , an . 2.1. MATHEMATICAL PRELIMINARIES 51 example: Consider the set (population) of all individuals in the United States. If • A is all those who carry the AIDS virus • B is all homosexuals • C is all IV drug users Then • The set of all individuals who carry the AIDS virus and satisfy only one of the other two conditions is (A ∩ B ∩ C c ) ∪ (A ∩ B c ∩ C) • The set of all individuals satisfying at least two of the conditions is (A ∩ B) ∪ (A ∩ C) ∪ (B ∩ C) • The set of individuals satisfying exactly two of the conditions is (A ∩ B ∩ C c ) ∪ (A ∩ B c ∩ C) ∪ (Ac ∩ B ∩ C) • The set of all individuals satisfying all three conditions is A∩B∩C • The set of all individuals satisfying at least one of the conditions is A∪B∪C 52 CHAPTER 2. PROBABILITY 2.1.2 Counting Many probability problems involve “counting the number of ways” something can occur. Basic Principle of Counting: Given two sets A and B with n1 and n2 elements respectively of the form A = {a1 , a2 , . . . , an1 } B = {b1 , b2 , . . . , bn2 } then the set A × B consisting of all ordered pairs of the form (ai , bj ) contains n1 n2 elements. • To see this consider the table a1 a2 .. . b1 (a1 , b1 ) (a2 , b1 ) .. . b2 a1 , b2 ) a2 , b2 ) .. . ··· ··· ··· ... bn2 (a1 , bn2 ) (a2 , bn2 ) .. . an1 (an1 , b1 ) an1 , b2 ) · · · (an1 , bn2 ) The conclusion is thus obvious. • Equivalently: If there are n1 ways to perform operation 1 and n2 ways to perform operation 2 then there are n1 n2 ways to perform first operation 1 and then operation 2. • In general if there are r operations in which the ith operation can be performed in ni ways then there are n1 n2 · · · nr ways to perform the r operations in sequence. • Permutations: If a set S contains n elements, there are n! = n × (n − 1) × · · · × 3 × 2 × 1 different n tuples which can be formed from the n elements of S. – By convention 0! = 1. – If r ≤ n there are (n)r = (n − r + 1)(n − r + 2) · · · (n − 1)n r tuples composed of elements of S. 2.1. MATHEMATICAL PRELIMINARIES 53 • Combinations: If a set S contains n elements and r ≤ n, there are à ! Crn = n n! = r r!(n − r)! subsets of size r containing elements of S. To see this we note that if we have a subset of size r from S there are r! permutations of its elements, each of which is an r tuple of elements from S. Therefore we have the equation r! Crn = (n)r and the conclusion follows. examples: (1) For an ordinary deck of 52 cards there are 52 × 51 × 50 ways to choose a “hand” of three cards. (2) If we toss two dies (each six-sided with sides numbered 1-6) there are 36 possible outcomes. (3) The use of the convention that 0! = 1 can be considered a special case of the Gamma function defined by Z ∞ Γ(α) = xα−1 e−x dx 0 defined for any positive α. We note by integration by parts that Γ(α) = (α − 1)x ¯∞ Z ∞ ¯ ¯ + (α − 1) xα−2 e−x dx = (α − 1)Γ(α − 1) ¯ 0 α−1 ¯ 0 It follows that if α = n where n is an integer then Γ(n) = (n − 1)! and hence with n = 1 0! = Γ(1) = Z ∞ 0 e−x dx = 1 54 CHAPTER 2. PROBABILITY 2.2 Relating Probability to Responses and Populations Probability is a measure of the uncertainty associated with the occurrence of events. • In applications to statistics probability is used to model the uncertainty associated with the response of a study. • Using probability models and observed responses (data) we make statements (statistical inferences) about the study: ◦ The probability model allows us to relate the uncertainty associated with sample results to statements about population characteristics. ◦ Without such models we can say little about the population and virtually nothing about the reliability or generalizability of our results. • The term experiment or statistical experiment or random experiment denotes the performance of an observational study, a census or sample survey or a designed experiment. ◦ The collection, Ω, of all possible results of an experiment will be called the sample space. ◦ A particular result of an experiment will be called an elementary event and denoted by ω. ◦ An event is a collection of elementary events. ◦ Events are thus sets of elementary events. 2.2. RELATING PROBABILITY TO RESPONSES AND POPULATIONS • Notation and interpretations: ◦ ω ∈ E means that E occurs when ω occurs ◦ ω 6∈ E means that E does not occur when ω occurs ◦ E ⊂ F means that the occurrence of E implies the occurrence of F ◦ E ∩ F means the event that both E and F occur ◦ E ∪ F means the event that at least one of E or F occur ◦ φ denotes the impossible event ◦ E ∩ F = φ means that E and F are mutually exclusive ◦ E c is the event that E does not occur ◦ Ω is the sample space 55 56 CHAPTER 2. PROBABILITY 2.3 Probability and Odds - Basic Definitions 2.3.1 Probability Definition: Probability is an assignment to each event of a number called its probability such that the following three conditions are satisfied: (1) P (Ω) = 1 i.e. the probability assigned to the certain event or sample space is 1 (2) 0 ≤ P (E) ≤ 1 for any event E i.e. the probability assigned to any event must be between 0 and 1 (3) If E1 and E2 are mutually exclusive then P (E1 ∪ E2 ) = P (E1 ) + P (E2 ) i.e. the probability assigned to the union of mutually exclusive events equals the sum of the probabilities assigned to the individual events. P (E) is called the probability of the event E Note: In considering probabilities for continuous responses we need a stronger form of (3): P (∪i Ei ) = X P (Ei ) i for any countable collection of events which are mutually exclusive. 2.3. PROBABILITY AND ODDS - BASIC DEFINITIONS 2.3.2 57 Properties of Probability Important properties of probabilities are: • P (E c ) = 1 − P (E) • P (∅) = 0 • E1 ⊂ E2 implies P (E1 ) ≤ P (E2 ) • P (E1 ∪ E2 ) = P (E1 ) + P (E2 ) − P (E1 ∩ E2 ) Rather than develop the theory of probability we will: • Develop the most important probability models used in statistics. • Learn to use these models to make calculations according to the definitions and properties listed above • Learn how to interpret probabilities. examples: • Suppose that P (A) = .4, P (B) = .3 and P (A ∩ B) = .2 then P (A ∪ B) = .4 + .3 − .2 = .5 • For any three events A, B and C we have P (A∪B ∪C) = P (A)+P (B)+P (C)−P (A∩B)−P (A∩C)−P (B ∩C)+P (A∩B ∩C) and hence P (A ∪ B ∪ C) ≤ P (A) + P (B) + P (C) 58 CHAPTER 2. PROBABILITY 2.3.3 Methods for Obtaining Probability Models The four most important sample spaces for statistical applications are ◦ {0, 1, 2, . . . , n} (discrete-finite) ◦ {0, 1, 2, . . .} (discrete-countable) ◦ [0, ∞) (continuous) ◦ {(−∞, ∞)} (continuous) For these sample spaces probabilities are defined by probability mass functions (discrete case) and probability density functions (continuous case). We shall call both of these probability density functions (pdfs). ◦ For the discrete cases a pdf assigns a number f (x) to each x in the sample space such that X f (x) ≥ 0 and f (x) = 1 x Then P (E) is defined by P (E) = X f (x) x∈E ◦ For the continuous cases a pdf assigns a number f (x) to each x in the sample space such that Z f (x) ≥ 0 and f (x)dx = 1 x Then P (E) is defined by Z P (E) = x∈E f (x)dx Since sums and integrals over disjoint sets are additive probabilities can be assigned using pdfs (i.e. the probabilities so assigned obey the three axioms of probabilities). examples: ◦ If à ! f (x) = n x p (1 − p)n−x x = 0, 1, 2, . . . , n x 2.3. PROBABILITY AND ODDS - BASIC DEFINITIONS 59 where 0 ≤ p ≤ 1 we have a binomial probabilty model with parameter p. The fact that X f (x) = x n X à ! n x p (1 − p)n−x = 1 x x=0 follows from the fact (Newton’s binomial expansion) that n X (a + b) = à ! n x n−x a b x x=0 for any a and b. ◦ If λx e−λ x = 0, 1, 2, . . . x! where λ ≥ 0 we have a Poisson probability model with parameter λ. The fact that f (x) = X f (x) = x ∞ X λx e−λ x=0 follows from the fact that ∞ X λx x=0 x! x! =1 = eλ ◦ If f (x) = λe−λx 0 ≤ x < ∞) where λ ≥ 0 we have an exponential probability model with parameter λ. The fact that Z Z ∞ f (x)dx = λe−λx dx = 1 x follows from the fact that 0 Z ∞ 0 e−λx dx = 1 λ 60 CHAPTER 2. PROBABILITY ◦ If f (x) = (2πσ)−1/2 exp{−(x − µ)2 /2σ 2 } − ∞ < x < +∞ where −∞ < µ < +∞ and σ > 0 we have a normal or Gaussian probability model with parameters µ and σ 2 . The fact that Z x f (x)dx = Z +∞ −∞ (2πσ)−1/2 exp{−(x − µ)2 /2σ 2 }dx = 1 is shown in the supplemental notes. Each of the above examples of probability models play major roles in the statistical analysis of data from experimental studies. The binomial is used to model prospective (cohort), retrospective (case-control) studies in epeidemiology, the Poisson is used to model accident data, the exponential is used to model failure time data and the normal distribution is used for measurement data which has a bell-shaped distribution as well as to approximate the binomial and Poisson. The normal distribution also figures in the calculation of many common statistics used for inference via the Central Limit Theorem. All of these models are special cases of the exponential family of distributions defined as having pdfs of the form: p X f (x; θ1 , θ2 , . . . , θp ) = C(θ1 , θ2 , . . . , θp )h(x) exp j=1 tj (x)qj (θ) 2.3. PROBABILITY AND ODDS - BASIC DEFINITIONS 2.3.4 61 Odds Closely related to probabilities are odds. • If the odds of an event E occurring are given as a to b this means, by definition, that P (E) P (E) a = = c P (E ) 1 − P (E) b We can solve for P (E) to obtain P (E) = a a+b ◦ Thus we can go from odds to probabilities and vice-versa. ◦ Thinking about probabilities in terms of odds sometimes provides useful interpretation of probability statements. • Odds can also be given as the odds against E are c to d. This means that P (E c ) 1 − P (E) c = = P (E) P (E) d so that in this case P (E) = d c+d • example: The odds against disease 1 are 9 to 1. Thus P (disease 1) = 1 = .1 1+9 • example: The odds of thundershowers this afternoon are 2 to 3. Thus P (thundershowers) = 2 = .4 2+3 62 CHAPTER 2. PROBABILITY • Ratios of odds are called odds ratios and play an important role in modern epidemiology where they are used to quantify the risk associated with exposure. ◦ example: Let OR be the odds ratio for the occurrence of a disease in an exposed population relative to an unexposed or control population. Thus odds of disease in exposed population OR = = odds of disease in control population p2 1−p2 p1 1−p1 where p2 is the probability of the disease in the exposed population and p1 is the probability of the disease in the control population. ◦ Note that if OR = 1 then p2 p1 = 1 − p2 1 − p1 which implies that p2 = p1 i.e. that the probability of disease is the same in the exposed and control population. ◦ If OR > 1 then p1 p2 > 1 − p2 1 − p1 which can be shown to imply that p2 > p1 i.e. that the probability of disease in the exposed population exceeds the probability of the disease in the control population. ◦ If OR < 1 the reverse conclusion holds i.e. the probability of disease in the control population exceeds the probability of disease in the exposed population. 2.3. PROBABILITY AND ODDS - BASIC DEFINITIONS 63 • The odds ratio, while useful in comparing the relative magnitude of risk of disease does not convey the absolute magnitude of the risk (unless the risk is small). ◦ Note that p2 1−p2 p1 1−p1 implies that = OR " p1 p2 = OR 1 + (OR − 1)p1 # ◦ Consider a situation in which the odds ratio is 100 for exposed vs control. Thus if OR = 100 and p1 = 10−6 (one in a million) then p2 is approximately 10−4 (one in ten thousand). If p1 = 10−2 (one in a hundred) then p2 = 100 1 100³ 1 + 99 1 100 ´ = 100 = .50 199 64 2.4 CHAPTER 2. PROBABILITY Interpretations of Probability Philosophers have discussed for several centuries at various levels what constitues “probability”. For our purposes probability has three useful operational interpretations. 2.4.1 Equally Likely Interpretation Consider an experiment where the sample space consists of a finite number of elementary events e1 , e2 , . . . , eN If, before the experiment is performed, we consider each of the elementary events to be “equally likely” or exchangeable then an assignment of probability is given by p({ei }) = 1 N This allows an interpretation of statements such as “we selected an individual at random from a population” since in ordinary language at random means that each invidual has the same chance of being selected. Although defining probability via this recipe is circular it is a useful interpretation in any situation where the sample space is finite and the elementary events are deemed equally likely. It forms the basis of much of sample survey theory where we select individuals at random from a population in order to investigate properties of the population. Summary: The equally likely interpretation assumes that each element in the sample space has the same chance of occuring. 2.4. INTERPRETATIONS OF PROBABILITY 2.4.2 65 Relative Frequency Interpretation Another interpretation of probability is the so called relative frequency interpretation. • Imagine a long series of trials in which the event of interest either occurs or does not occur. • The relative frequency (number of trials in which the event occurs divided by the total number of trials) of the event in this long series of trials is taken to be the probability of the event. • This interpretation of probability is the most widely used interpretation in scientific studies. Note, however, that it is also circular. • It is often called the “long run frequency interpretation”. 2.4.3 Subjective Probability Interpretation This interpretation of probability requires the personal evaluation of probabilities using indifference between two wagers (bets). Suppose that you are interested in determining the probability of an event E. Consider two wagers defined as follows: Wager 1 : You receive $100 if the event E occurs and nothing if it does not occur. Wager 2 : There is a jar containing x white balls and N − x red balls. You receive $100 if a white ball is drawn and nothing otherwise. You are required to make one of the two wagers. Your probability of E is taken to be the ratio x/N at which you are indifferent between the two wagers. 66 CHAPTER 2. PROBABILITY 2.4.4 Does it Matter? • For most applications of probability in modern statistics the specific interpretation of probability does not matter all that much. • What matters is that probabilities have the properties given in the definition and those properties derived from them. • In this course we will take probability as a primitive concept leaving it to philosophers to argue the merits of particular interpretations. • Each of the interpretations discussed above satisfies the three basic axioms of the definition of probability. 2.5. CONDITIONAL PROBABILITY 2.5 67 Conditional Probability • Conditional probabilities possess all the properties of probabilities. • Conditional probabilities provide a method to revise probabilities in the light of additional information (the process itself is called conditioning). • Conditional probabilities are important because almost all probabilities are conditional probabilities. example: Suppose a coin is flipped twice and you are told that at least one coin is a head. What is the chance or probability that they are both heads? Assuming a fair coin and a good toss each of the four possibilities {(H, H), (H, T ), (T, H), (T, T )} which constitutes the sample space for this experiment has the same probability i.e. 1/4. Since the information given rules out (T, T ); a logical answer for the conditional probability of two heads given at least one head is 1/3. example: A family has three children. What is the probability that two of the children are boys? Assuming that gender distributions are equally likely the eight equally likely possibilities are: {(B, B, B), (B, B, G), (B, G, B), (G, B, B), (G, G, B), (G, B, G), (B, G, G), (G, G, G)} Thus the probability of two boys is 1 1 1 3 + + = 8 8 8 8 Depending on the conditioning information the probability of two boys is modified e.g. • What is the probability of two boys if you are told that at least one child in the family is a boy? Answer: 37 68 CHAPTER 2. PROBABILITY • What is the probability of two boys if you are told that at least one child in the family is a girl? Answer: 37 • What is the probability of two boys if you are told that the oldest child is a boy? Answer: 12 • What is the probability of two boys if you are told that the oldest child is a girl? Answer: 14 We generalize to other situations using the following definition: Definition: The conditional probability of event B given event A is P (B|A) = P (B ∩ A) P (A) provided that P (A) > 0 example: The probability of two boys given that the oldest child is a boy is the probability of the event “two boys in the family and the oldest child in the family is a boy” divided by the probability of the event “the oldest child in the family is a boy”. Thus the required conditional probability is given by P ({(B, G, B), (G, B, B)}) = P ({(B, B, B), (B, G, B), (G, B, B), (G, G, B)}) 2 8 4 8 = 1 2 2.5. CONDITIONAL PROBABILITY 2.5.1 69 Multiplication Rule The multiplication rule for probabilities is as follows: P (A ∩ B) = P (A)P (B|A) which can immediately be extended to P (A ∩ B ∩ C) = P (A)P (B|A)P (C|A ∩ B) and in general to: P (E1 ∩ E2 ∩ · · · ∩ En ) = P (E1 )P (E2 |E1 ) · · · P (En |E1 ∩ E2 ∩ · · · ∩ En−1 ) example: There are n people in a room. What is the probability that at least two of the people have a common birthday? Solution: We first note that P (common birthday) = 1 − P (no common birthday) If there are just two people in the room then µ 365 365 ¶µ 364 365 ¶ 365 P (no common birthday) = 365 ¶µ 364 365 ¶µ 363 365 P (no common birthday) = while for three people we have µ ¶ It follows that the probability of no common birthday with n people in the room is given by µ 365 365 ¶µ ¶ à 364 365 − (n − 1) ··· 365 365 ! 70 CHAPTER 2. PROBABILITY Simple calculations show that if n = 23 then the probability of no common birthday is slightly less than 12 . Thus if the number of people in a room is 23 or larger the probability of a common birthday exceeds 12 . The following is a short table of the results for other values of n n 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Prob .003 .008 .016 .027 .041 .056 .074 .095 .117 .141 .167 .194 .223 .253 .284 n 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Prob .315 .347 .379 .411 .444 .476 .507 .538 .569 .598 .627 .654 .681 .706 .730 2.5. CONDITIONAL PROBABILITY 2.5.2 71 Law of Total Probability Law of Total Probability: For any event E we have P (E) = X P (E|Ei )P (Ei ) i where Ei is a partition of the sample space i.e. the Ei are mutually exclusive and their union is the sample space. example: An examination consists of multiple choice questions. Each question is a multiple choice question in which there are 5 alternative answers only one of which is correct. If a student has diligently done his or her homework he or she is certain to select the correct answer. If not he or she has only a one in five chance of selecting the correct answer (i.e. they choose an answer at random). Let • p be the probability that the student does their homework • A the event that they do their homework • B the event that they select the correct answer 72 CHAPTER 2. PROBABILITY (i) What is the probability that the student selects the correct answer to a question? Solution: We are given P (A) = p ; P (B|A) = 1 and P (B|Ac ) = 1 5 By the Law of Total Probability P (B) = P (A)P (B|A) + P (Ac )P (B|Ac ) µ ¶ 1 = p × 1 + (1 − p) × 5 5p + 1 − p = 5 4p + 1 = 5 (ii) What is the probability that the student did his or her homework given that they selected the correct anwer to the question? Solution: In this case we want P (A|B) so that P (A ∩ B) P (B) P (A)P (B|A) = P (B) 1×p = 4p+1 P (A|B) = 5 5p = 4p + 1 2.5. CONDITIONAL PROBABILITY 73 example: Cross-Sectional Study Suppose a population of individuals is classified into four categories defined by • their disease status (D is diseased and Dc is not diseased) • their exposure status (E is exposed and E c is not exposed). If we observe a sample of n individuals so classified we have the following population probabilities and observed data. Population Probabilities Dc D c c c E P (E , D ) P (E c , D) E P (E, Dc ) P (E, D) Total P (Dc ) P (D) Total P (E c ) P (E) 1 Sample Numbers Dc D c c n(E , D ) n(E c , D) n(E, Dc ) n(E, D) n(Dc ) n(D) The law of total probability then states that P (D) = P (E, D) + P (E c , D) = P (D|E)P (E) + P (D|E c )P (E c ) Total n(E c ) n(E) n 74 CHAPTER 2. PROBABILITY Define the following quantities: Population Parameters prob of exposure P (E) = P (E, D) + P (E, Dc ) prob of disease given exposed P (D|E) = PP(E,D) (E) odds of disease if exposed (D,E) O(D|E) = PP(D c ,E) odds of disease if not exposed (D,E c ) O(D|E c ) = PP(D c ,E c ) odds ratio (relative odds) O(D|E) OR = O(D|E c) relative risk (D|E) RR = PP(D|E c) Sample Estimates prob of exposure c) p(E) = n(E,D)+n(E,D n prob of disease given exposed p(D|E) = n(D,E) n(E) odds of disease if exposed n(D,E) o(D|E) = n(D c ,E) odds of disease if not exposed n(D,E c ) o(D|E c ) = n(D c ,E c ) odds ratio (relative odds) o(D|E) or = o(D|E c) relative risk p(D|E) rr = p(D|E c) It can be shown that if the disease is rare in both the exposed group and the non exposed group then OR ≈ RR The above population parameters are fundamental to the epidemiological approach to the study of disease as it relates to exposure. example: In demography the crude death rate is defined as CDR = Total Deaths D = Population Size N If the population is divided into k age groups or other strata defined by gender, ethnicity, etc. then D = D1 + D2 + · · · + Dk and N = N1 + N2 + · · · + Nk and hence D CR = = N Pk i=1 N Di Pk = k Ni Mi X = pi Mi N i=1 i=1 where Mi = Di /Ni is the age specfic death rate for the ith age group and pi = Ni /N is the proportion of the population in the ith age group. This is directly analogous to the law of total probability, 2.6. BAYES THEOREM 2.6 75 Bayes Theorem Bayes theorem combines the definition of conditional probability, the multiplication rule and the law of total probability and asserts that P (Ei )P (E|Ei ) P (Ei |E) = P j P (Ej )P (E|Ej ) • where E is any event • the Ej constitute a partition of the sample space • Ei is any event in the partition. Since P (Ei ∩ E) P (E) P (Ei ∩ E) = P (Ei )P (E|Ei ) X P (E) = P (Ej )P (E|Ej ) P (Ei |E) = j Bayes theorm is obviously true. Note: A partition of the sample space is a collection of mutually exclusive events such that their union is the sample space. 76 CHAPTER 2. PROBABILITY example: The probability of disease given exposure is .5 while the probability of disease given non-exposure is .1. Suppose that 10% of the population is exposed. If a diseased individual is detected what is the probability that the individual was exposed? Solution: By Bayes theorem P (Ex)P (Dis|Ex) P (Ex)P (Dis|Ex) + P (N o Ex)P (Dis|N o Ex) (.1)(.5) = (.1)(.5) + (.9)(.1) 5 = 5+9 5 = 14 P (Ex|Dis) = The intuitive explanation for this result is as follows: • Given 1,000 individuals 100 will be exposed and 900 not exposed • Of the 100 individuals exposed 50 will have the disease. • of the 900 non exposed individuals 90 will have the disease Thus of the 140 individuals with the disease, 50 will have been exposed which yields a 5 proportion of 14 . 2.6. BAYES THEOREM 77 example: Diagnostic Tests In this type of study we are interested in the performance of a diagnostic test designed to determine whether a person has a disease. The test has two possible results: • + positive test (the test indicates presence of disease). • − negative test (the test does not indicate presence of disease). We thus have the following setup: Population Probabilities Dc D Total c − P (−, D ) P (−, D) P (−) + P (+, Dc ) P (+, D) P (+) Total P (Dc ) P (D) 1 Sample Numbers Dc D Total c n(−, D ) n(−, D) n(−) n(+, Dc ) n(+, D) n(+) n(Dc ) n(D) n 78 CHAPTER 2. PROBABILITY We define the following quantities: Population Parameters sensitivity P (+,D) P (+|D) = P (+,D)+P (−,D) specificity c) P (−|Dc ) = P (−,DP c(−,D )+P (+,Dc ) positive test probability P (+) = P (+, D) + P (+, Dc ) negative test probability P (−) = P (−, D) + P (−, Dc ) positive predictive value P (D|+) = PP(+,D) (+) negative predictive value c) P (Dc |−) = P (−,D P (−) Sample Estimates sensitivity n(+,D) p(+|D) = n(+,D)+n(−,D) specificity c) c p(−|D ) = n(−,Dn(−,D c )+n(+,D c ) proportion positive test p(+) = n(+) n proportion negative test p(−) = n(−) n positive predictive value p(D|+) = p(+,D) p(+) negative predictive value c) p(Dc |−) = p(−,D p(−) As an example consider the performance of a blood sugar diagnostic test to determine whether a person has diabetes. The test has two possible results: • + positive test (the test indicates presence of diabetes). • − negative test (the test does not indicate presence of diabetes). 2.6. BAYES THEOREM 79 The following numerical example is from Epidemiology (1996) Gordis, L. W. B. Saunders. We have the following setup: Population Probabilities Dc D Total c − P (−, D ) P (−, D) P (−) + P (+, Dc ) P (+, D) P (+) Total P (Dc ) P (D) 1 Sample Numbers Dc D Total 7600 150 7750 1900 350 2250 9500 500 10, 000 We calculate the following quantities: Population Parameters sensitivity P (+,D) P (+|D) = P (+,D)+P (−,D) specificity c) P (−|Dc ) = P (−,DP c(−,D )+P (+,Dc ) positive test probability P (+) = P (+, D) + P (+, Dc ) negative test probability P (−) = P (−, D) + P (−, Dc ) positive predictive value P (D|+) = PP(+,D) (+) negative predictive value c) P (Dc |−) = P (−,D P (−) Sample Estimates sensitivity = .70 p(+|D) = 350 500 specificity 7600 p(−|Dc ) = 9500 = .80 proportion positive test 2250 p(+) = 10,000 = .225 proportion negative test 7750 = .775 p(−) = 10,000 positive predictive value 350 p(D|+) = 2250 = 0.156 negative predictive value 7600 p(Dc |−) = 7750 = 0.98 80 CHAPTER 2. PROBABILITY 2.7 Independence Closely related to the concept of conditional probability is the concept of independence of events. Definition Events A and B are said to be independent if P (B|A) = P (B) Thus knowledge of the occurrence of A does not influence the assignment of probabilities to B. Since P (B|A) = P (A ∩ B) P (A) it follows that if A and B are independent then P (A ∩ B) = P (A)P (B) This last formulation of independence is the definition used in building probability models. 2.8. BERNOULLI TRIAL MODELS; THE BINOMIAL DISTRIBUTION 2.8 81 Bernoulli trial models; the binomial distribution • One of the most important probability models is the binomial. It is widely used in epidemiology and throughout statistics. • The binomial model is based on the assumption of Bernoulli trials. The assumptions for a Bernoulli trial model are (1) The result of the experiment or study can be thought of as the result of n smaller experiments called trials each of which has only two possible outcomes e.g. (dead, alive), (diseased, non-diseased), (success, failure) (2) The outcomes of the trials are independent (3) The probabilities of the outcomes of the trials remain the same from trial to trial (homogeneous probabilities). example 1: A group of n individuals are tested to see if they have elevated levels of cholestrol. Assuming the results are recorded as elevated or not elevated and we can justify (2) and (3) we may apply the Bernoulli trial model. example 2: A population of n individuals is found to have d deaths during a given period of time. Assuming we can justify (2) and (3) we may use the Bernoulli model to describe the results of the study. In Bernoulli trial models the quantity of interest is the number of successes x which occur in the n trials. It can be be shown that the following formula gives the probability of obtaining x successes in n Bernoulli trials à ! P (x) = n x p (1 − p)n−x x where • x can be 0, 1, 2, . . . , n • p is the probability of success on a given trial 82 CHAPTER 2. PROBABILITY • ³ ´ n x , read as ”n choose x”, is defined by à ! n! n = x x! (n − x)! In this last formula r! = r(r − 1)(r − 2) · · · 3 · 2 · 1 for any integer r and 0! = 1. Note: The term distribution is used because the formula describes how to distribute probability over the possible values of x. example: The chance or probability of having an elevated cholesterol level is 1/100. If 10 individuals are examined, what is the probability that one or more of them will have been exposed? Solution: The binomial model applies so that à ! 10 P (0) = (.01)0 (1 − .01)10−0 0 = (.99)10 Thus P (1 or more elevated) = 1 − P (0 elevated) = 1 − (.99)10 = .059 2.9. PARAMETERS AND RANDOM SAMPLING 2.9 83 Parameters and Random Sampling • The numbers n and p which appear in the formula for the binomial distribution are examples of what statisticians call parameters. • Different values of n and p give different assignments of probabilities each of the binomial type. • Thus a parameter can be considered as a label which identifies the particular assignment of probabilities. • In applications of the binomial distribution the parameter n is known and can be fixed by the investigator - it is thus a study design parameter. • The parameter p, on the other hand, is unknown and obtaining information about it is the reason for performing the experiment. We use the observed data and the model to tell us something about p. This same set-up applies in most applications of statistics. To summarize: • Probability distributions relate observed data to parameters. • Statistical methods use data and probability models to make statements about the parameters of interest. In the case of the binomial the parameter of interest is p, the probability of success on a given trial. 84 CHAPTER 2. PROBABILITY example: Random sampling and the binomial distribution. In many circumstances we are given the results of a survey or study in which the investigators state that they examined a “random sample” from the population of interest. Suppose we have a population containing N individuals or objects. We are presented with a “random sample” consisting of n individuals from the population. What does this mean? We begin by defining what we mean by a sample. Definition: A sample of size n from a target population T containing N objects is an ordered collection of n objects each of which is an object in the target population. In set notation a sample is just an n-tuple with each coordinate being an element of the target population. In symbols then a sample s is s = (a1 , a2 , . . . , an ) where a1 ∈ T, a2 ∈ T, . . . , an ∈ T . Specific example: If T = {a, b, c, d} then a possible sample of size 2 is (a, b) while some others are (b, a) and (c, d). What about (a, a)? Clearly, this is a sample according to the definition. To distinguish between these two types of samples: • A sample is taken with replacement if an element in the population can appear more than once in the sample • A sample is taken without replacement if an element in the population can appear at most once in the sample. 2.9. PARAMETERS AND RANDOM SAMPLING 85 Thus in our example the possible samples of size 2 with replacement are (a, a) (b, a) (c, a) (d, a) (a, b) (b, b) (c, b) (d, b) (a, c) (b, c) (c, c) (d, c) (a, d) (b, d) (c, d) (d, d) while without replacement the possible samples are (a, b) (b, a) (b, c) (d, b) (a, c) (c, a) (c, b) (c, d) (a, d) (d, a) (b, d) (d, c) Definition: A random sample of size n from a population of size N is a sample which is selected such that each sample has the same chance of being selected i.e. P (sample selected) = 1 number of possible samples 1 Thus in the example each sample with replacement would be assigned a chance of 16 while 1 each sample without replacement would be assigned a chance of 12 for random sampling. 86 CHAPTER 2. PROBABILITY In the general case, • For sampling with replacement the probability assigned to each sample is 1 Nn • For sampling without replacement the probability assigned to each sample is 1 (N )n where (N )n is given by: (N )n = N (N − 1)(N − 2) · · · (N − n + 1) In our example we see that N n = 42 = 16 and (N )n = (4)2 = 4(4 − 2 + 1) = 4 × 3 = 12 To summarize: A random sample is the result of a selection process in which each sample has the same chance of being selected. 2.9. PARAMETERS AND RANDOM SAMPLING 87 Suppose now that each object in the population can be classified into one of two categories e.g. (exposed, not exposed), (success, failure), (A, not A), (0, 1) etc. For definiteness let us call the two outcomes success and failure and denote them by S and F . In the example suppose that a and b are successes while c and d are failures. The target population is now T = {a(S), b(S), c(F ), d(F )} In general D of the objects will be successes and N − D will be failures. The question of interest is: If we select a random sample of size n from a population of size N consisting of D successes and N − D failures, what is the probability that x successes will be observed in the sample? In the example we see that with replacement the samples are (a(S), a(S)) (b(S), a(S)) (c(F ), a(S)) (d(F ), a(S)) (a(S), b(S)) (b(S), b(S)) (c(F ), b(S)) (d(F ), b(S)) (a(S), c(F )) (b(S), c(F )) (c(F ), c(F )) (d(F ), c(F )) (a(S), d(F )) (b(S), d(F )) (c(F ), d(F )) (d(F ), d(F )) Thus if sampling is at random with replacement the probabilities of 0 successes, 1 success and 2 successes are given by 4 16 8 P (1) = 16 4 P (2) = 16 P (0) = If sampling is at random without replacement the probabilities are given by 2 12 8 P (1) = 12 2 P (2) = 12 P (0) = 88 CHAPTER 2. PROBABILITY These probabilities can, in the general case, be shown to be without replacement : à ! P (x successes) = with replacement : n (D)x (N − D)n−x x (N )n à !µ n P (x successes) = x D N ¶x µ D 1− N ¶n−x The distribution without replacement is called the hypergeometric distribution with parameters N, n and D. The distribution with replacement is the binomial distribution with parameters n and p = D/N . In many applications the sample size, n, is small relative to the population size N . In this situation it can be shown that the formula à !µ n x D N ¶x µ D 1− N ¶n−x provides an adequate approximation to the probabilities for sampling without replacement. Thus for most applications, random sampling from a population in which each individual is classified as a success or a failure results in a binomial distribution for the probability of obtaining x successes in the sample. The interpretation of the parameter p = D N is thus: • “the proportion of successes in the target population” • “the chance that an individual selected at random will be classified as a success”. 2.9. PARAMETERS AND RANDOM SAMPLING 89 example: Prospective (Cohort) Study In this type of study • we observe n(E) individuals who are exposed and n(E c ) individuals who are not exposed. • These individuals are followed and the number in each group who develop the disease are recorded. We thus have the following setup: Ec E Population Probabilities c D D P (Dc |E c ) P (D|E c ) P (Dc |E) P (D|E) Total 1 1 Sample Numbers c D D Total n(Dc , E c ) n(D, E c ) n(E c ) n(Dc , E) n(D, E) n(E) We can model this situation as two independent binomial distributions as follows: n(D, E) is binomial (n(E), P (D|E)) n(D, E c ) is binomial (n(E c ), P (D|E c )) 90 CHAPTER 2. PROBABILITY We define the following quantities: Population Parameters prob of disease given exposed P (D|E) = PP(E,D) (E) odds of disease if exposed (D,E) O(D|E) = PP(D c ,E) odds of disease if not exposed (D,E c ) O(D|E c ) = PP(D c ,E c ) odds ratio (relative odds) O(D|E) OR = O(D|E c) relative risk (D|E) RR = PP(D|E c) Sample Estimates prob of disease given exposed p(D|E) = n(D,E) n(E) odds of disease if exposed n(D,E) o(D|E) = n(D c ,E) odds of disease if not exposed n(D,E c ) o(D|E c ) = n(D c ,E c ) odds ratio (relative odds) o(D|E) or = o(D|E c) relative risk p(D|E) rr = p(D|E c) As an example consider the following hypothetical study in which we follow smokers and non smokers to see which individuals develop coronary heart disease (CHD). Thus E is smoker and E c is non smoker. This example is from Epidemiology (1996) Gordis, L. W. B. Saunders. 2.9. PARAMETERS AND RANDOM SAMPLING 91 We have the following setup: c E E Population Probabilities Dc D c c P (D |E ) P (D|E c ) P (Dc |E) P (D|E) Total 1 1 Sample Numbers No CHD CHD Total 4, 913 87 5, 000 2, 916 84 3, 000 We calculate the following quantities: Population Parameters prob of disease given exposed P (D|E) = PP(E,D) (E) odds of disease if exposed (D,E) O(D|E) = PP(D c ,E) odds of disease if not exposed (D,E c ) O(D|E c ) = PP(D c ,E c ) odds ratio (relative odds) O(D|E) OR = O(D|E c) relative risk (D|E) RR = PP(D|E c) Sample Estimates prob of disease given exposed 84 p(CHD|S) = 3,000 = 0.028 odds of disease if exposed 84 o(CHD|S) = 2916 = 0.0288 odds of disease if not exposed 87 o(CHD|N S) = 4913 = 0.0177 odds ratio (relative odds) or = 84/2916 = 1.63 87/4913 relative risk = 1.61 rr = 84/3000 87/5000 92 CHAPTER 2. PROBABILITY example: Retrospective (Case-Control) Study In this type of study we • Select n(D) individuals who have the disease (cases) and n(Dc ) individuals who do not have the disease (controls). • Then the number of individuals in each group who were exposed is determined. We thus have the following setup: Population Probabilities Dc D c c c E P (E |D ) P (E c |D) E P (E|Dc ) P (E|D) Total 1 1 Sample Numbers Dc D c c n(D , E ) n(D, E c ) n(Dc , E) n(D, E) n(Dc ) n(D) We can model this situation as two independent binomials as follows: n(D, E) is binomial (n(D), P (E|D)) n(Dc , E) is binomial (n(Dc ), P (E|Dc )) Define the following quantities: Population Parameters prob of exposed given diseased P (E|D) odds of exposed if disease (E|D) O(E|D) = PP(E c |D) odds of exposed if not disease (E|Dc ) O(E|Dc ) = PP(E c |D c ) odds ratio (relative odds) O(E|D) OR = O(E|D c) Sample Estimates prob of exposed given disease p(E|D) = n(D,E) n(D) odds of exposed if disease n(D,E) o(E|D) = n(D,E c) odds of exposed if not disease n(E,Dc ) o(E|Dc ) = n(E c ,D c ) odds ratio (relative odds) o(E|D) or = o(E|D c) 2.9. PARAMETERS AND RANDOM SAMPLING 93 As an example consider the following hypothetical study in which examine individuals with coronary heart disease (CHD) (cases) and individuals without coronary heart diease (controls). We then determine which individuals were smokers and which were not. Thus E is smoker and E c is non smoker. This example is from Epidemiology (1996) Gordis, L. W. B. Saunders. Ec E Total Population Probabilities Controls Cases c c P (E |D ) P (E c |D) P (E|Dc ) P (E|D) 1 1 Sample Numbers Controls Cases 224 88 176 112 400 200 We calculate the following quantities: Population Parameters prob of exposed given diseased P (E|D) odds of exposed if disease (E|D) O(E|D) = PP(E c |D) odds of exposed if not disease (E|Dc ) O(E|Dc ) = PP(E c |D c ) odds ratio (relative odds) O(E|D) OR = O(E|D c) Sample Estimates prob of exposed given disease 112 = 0.56 p(E|D) = 200 odds of exposed if disease o(E|D) = 112 = 1.27 88 odds of exposed if not disease 176 o(E|Dc ) = 224 = 0.79 odds ratio (relative odds) 112/88 or = 176/224 = 1.62 94 2.10 CHAPTER 2. PROBABILITY Probability Examples The following two examples illustrate the importance of probability in solving real problems. Each of the topics presented has been extended and generalized since their introduction. 2.10.1 Randomized Response Suppose that a sociologist is interested in determining the prevalence of child abuse in a population. Obviously if individual parents are asked a question such as “have you abused your child” the reliability of the answer is in doubt. The sociologist would ideally like the parent to respond with an honest choice between the following two questions: (i) Have you ever abused your children? (ii) Have you not abused your children? A clever method for determining prevalence in such a situation is to provide the respondent with a randomization device such as a deck of cards in which a proportion P of the cards are marked with the number 1 and the remainder with the number 2. The respondent selects a card at random and replaces it with the result unknown to the interviewer. Thus confidentiality of the respondent is protected. If the card drawn is 1 the respondent answers truthfully to question 1 whereas if the card drawn is a 2 the respondent answers truthfully to question 2. 2.10. PROBABILITY EXAMPLES 95 It follows that the probability λ that the respondent answers yes is given by λ = P (yes) = P (yes|Q1)P Q1) + P (yes|Q2)P Q2) = πP + (1 − π)(1 − P ) where π is the prevalence (the proportion in the population who abuse their children and P is the proportion of 1’s in the deck of cards. We assume P 6= 1/2. If we use this procedure on n respondents and observe x yes answers then the observed proportion x/n is a natural estimate of πP + (1 − π)(1 − P ) i.e. b= λ x = πP + (1 − π)(1 − P ) n Since we know P we can solve for π giving us the estimate πb = Reference: Encyclopedia of Biostatistics. b+1−P λ 2P − 1 96 2.10.2 CHAPTER 2. PROBABILITY Screening As another simple application of probability consider the following situation. We have a fixed amount of money available to test individuals for the presence of a disease, say $1,000. The cost of testing one sample of blood is $5. We have to test a population of size 1,000 in which we suspect the prevalence of the disease is 3/1,000. Can we do it? If we divide the population into 100 groups of size 10 then there should be 1 diseased individual in 3 of the groups and the remaining 97 groups will be disease free. If we pool the samples from each group and test each grouped sample we would need 100 + 30 = 130 tests instead of 1,000 tests to screen eveyone. The probabilistic version is as follows: A large number N of individuals are subject to a blood test which can be administered in one of two ways (i) Each individual is to be tested separately so that N tests are required. (ii) The samples of n negative then the positive then each are required if the individuals can be pooled or combined and tested. If this test is one test suffices to clear all of these n individuals. If this test is of the n individuals in that group must be tested. Thus n + 1 tests pooled samples tests positive. Assume that individuals are independent and that each has probability p of testing positive. Clearly we have a Bernoulli trial model and hence the probability that the combined sample will test positive is P (combined test positive) = 1 − P (combined test negative) = 1 − (1 − p)n Thus we have for any group of size n P (1 test) = (1 − p)n+1 ; P (n + 1 tests) = 1 − (1 − p)n It follows that the expected number of tests if we combine samples is (1 − p)n + (n + 1)[1 − (1 − p)n ] = n + 1 − n(1 − p)n Thus if there are N/n groups we expect to run · 1 N 1 + − (1 − p)n n ¸ tests if we combine samples instead of the N tests if we test each individual. Given a value of p we can choose n to minimize the total number of tests. 2.10. PROBABILITY EXAMPLES 97 As an example with N = 1, 000 and p = .01 we have the following numbers Group Size Number of Tests 2 519.9 3 363.0343 4 289.404 5 249.0099 6 225.1865 7 210.7918 8 202.2553 9 197.5939 10 195.6179 11 195.5708 12 196.9485 13 199.4021 14 202.6828 15 206.6083 16 211.0422 17 215.8803 18 221.0418 19 226.463 20 232.0931 Thus we should combine individuals into groups of size 10 or 11. In which case we expect to run 196 tests instead of 1,000 tests. Clearly we achieve real savings. Reference: Feller, W. (1950 An Introduction to Probability Theory and Its Applications. John Wiley & Sons. 98 CHAPTER 2. PROBABILITY Graph of Expected Number of Tests vs Group Size (N = 1, 000 and p = .01) Figure 2.1: Chapter 3 Probability Distributions 3.1 3.1.1 Random Variables and Distributions Introduction Most of the responses we model in statistics are numerical. It is useful to have a notation for real valued responses. Real valued responses are called random variables. The notation is not only convenient, it is imperative when we consider statistics, defined as functions of sample data. The probability models for these random variables are called their sampling distributions and form the foundation of the modern theory of statistics. Definition: • Before the experiment is performed the possible numerical response is denoted by X, X is called a random variable. • After the experiment is performed the observed value of X is denoted by x. We call x the realized or observed value of X. 99 100 CHAPTER 3. PROBABILITY DISTRIBUTIONS Notation: • The set of all possible values of a random variable X is called the sample space of X and is denoted by X . • The probability model of X is denoted by PX and we write PX (B) = P (X ∈ B) for the probability that the event X ∈ B occurs. • The probability model for X is called the probability distribution of X. There are two types of random variables which are of particular importance: discrete and continuous. These correspond to the two types of numbers introduced in the overview section and the two types of probability density functions introduced in the probability section. • A random variable is discrete if its possible values (sample space) constitute a finite or countable set e.g. X = {0, 1} ; X = {0, 1, 2, . . . , n} ; X = {0, 1, 2, . . .} ◦ Discrete random variables arise when we consider response variables which are categorical or counts. • A random variable is continuous or numeric if its possible values (sample space) is an interval of real numbers e.g. X = [0, ∞) ; X = (−∞, ∞) ◦ Continuous random variables arise when we consider response variables which are recorded on interval or ratio scales. 3.1. RANDOM VARIABLES AND DISTRIBUTIONS 3.1.2 101 Discrete Random Variables Probabilities for discrete random variables are specified by the probability density function p(x) : X PX (B) = P (X ∈ B) = p(x) x∈B Probability density functions for discrete random variables have the properties • 0 ≤ p(x) ≤ 1 for all x in the sample space X • P x∈X p(x) = 1 Binomial Distribution A random variable is said to have a binomial distribution if its probability density function is of the form: à ! n x p (1 − p)n−x for x = 0, 1, 2, . . . , n p(x) = x where 0 ≤ p ≤ 1. If we define X as the number of successes in n Bernoulli trials then X is a random variable with a binomial distribution. The parameters are n and p where p is the probability of success on a given trial. The term distribution is used because the formula describes how to distribute probability over the possible values of x. Recall that the assumptions necessary for a Bernoulli trial model to apply are: • The result of the experiment or study consists of the result of n smaller experiments called trials each of which has only two possible outcomes e.g. (dead, alive), (diseased, non-diseased), (success, failure). • The outcomes of the trials are independent. • The probabilities of the outcomes of the trials remain the same from trial to trial (homogeneous probabilities). 102 CHAPTER 3. PROBABILITY DISTRIBUTIONS Histograms of Binomial Distributions Figure 3.1: Note that as n ↑ the binomial distribution becomes more symmetric. 3.1. RANDOM VARIABLES AND DISTRIBUTIONS 103 Poisson Distribution A random variable is said to have a Poisson distribution if its probability distribution is given by λx e−λ p(x) = for x = 0, 1, 2, . . . x! • The parameter of the Poisson distribution is λ. • The Poisson distribution is one of the most important distributions in the applications of statistics to public health problems. The reasons are: ◦ It is ideally suited for modelling the occurence of “rare events”. ◦ It is also particularly useful in modelling situations involving person-time. ◦ Specific examples of situations in which the Poisson distribution applies include: ◦ Number of deaths due to a rare disease ◦ Spatial distribution of bacteria ◦ Accidents The Poisson distribution is also useful in modelling the occurence of events over time. Suppose that we are interested in modelling a process where: (1) The occurrences of the event in an interval of time are independent. (2) The probability of a single occurrence of the event in an interval of time is proportional to the length of the interval. (3) In any extremely short time interval, the probability of more than one occurrence of the event is approximately zero. Under these assumptions: • The distribution of the random variable X, defined as the number of occurrences of the event in the interval is given by the Poisson distribution. • The parameter λ in this case is the average number of occurrences of the event in the interval i.e. λ = µt where µ is the rate per unit time 104 CHAPTER 3. PROBABILITY DISTRIBUTIONS example: Suppose that the suicide rate in a large city is 2 per week. Then the probability of two suicides in one week is P (2 suicides in one week) = 22 e−2 = .2707 = .271 2! The probability of two suicides in three weeks is P (2 suicides in three weeks) = 62 e−6 = .0446 = .045 2! example: The Poisson distribution is often used as a model for the probability of automobile or other accidents for the following reasons: (1) The population exposed is large. (2) The number of people involved in accidents is small. (3) The risk for each person is small. (4) Accidents are “random”. (5) The probability of being in two or more accidents in a short time period is approximately zero. 3.1. RANDOM VARIABLES AND DISTRIBUTIONS 105 Approximations using the Poisson Distribution Poisson probabilities can be used to approximate binomial probabilities when n is large, p is small and λ is taken to be np Thus for n = 150 and p = .02 we have the following table: x 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Binomial n = 150, p = .02 0.04830 0.14784 0.22478 0.22631 0.16974 0.10115 0.04989 0.02094 0.00764 0.00246 0.00071 0.00018 0.00004 0.00001 0.00000 Poisson λ = 150(.02) = 3 0.04979 0.14936 0.22404 0.22404 0.16803 0.10082 0.05041 0.02160 0.00810 0.00270 0.00081 0.00022 0.00006 0.00001 0.00000 Note the closeness of the approximation. The supplementary notes contain a “proof” of the propositition that the Poisson approximates the binomial when n is large and p is small. 106 CHAPTER 3. PROBABILITY DISTRIBUTIONS Histograms of Poisson Distributions Figure 3.2: Note that as n ↑ the Poisson distribution becomes more symmetric. 3.1. RANDOM VARIABLES AND DISTRIBUTIONS 3.1.3 107 Continuous or Numeric Random Variables Probabilities for numeric or continuous random variables are given by the area under the curve of its probability density function f (x). Z P (E) = E f (x)dx • f (x) has the properties: ◦ f (x) ≥ 0 ◦ The total area under the curve is one • Probabilities for numeric random variables are tabled or can be calculated using a statistical software package. The Normal Distribution By far the most important continuous probability distribution is the normal or Gaussian. The probability density function is given by: ( (x − µ)2 1 p(x) = √ exp − 2σ 2 2πσ ) • The normal distribution is used as a basic model when theobserved data has a histogram which is symmetric and bell-shaped. • In addition the normal distribution provides useful approximations to other distributions by the Central Limit Theorem. • The Central Limit Theorem also implies that a variety of statistics have distributions that can be approximated by normal distributions. • Most statistical methods were originally developed for the normal distribution and then extended to other distributions. • The parameter µ is the natural center of the distribution (since the distribution is symmetric about µ). 108 CHAPTER 3. PROBABILITY DISTRIBUTIONS • The parameter σ 2 or σ provides a measure of spread or scale. • The special case where µ = 0 and σ 2 = 1 is called the standard normal or Z distribution 3.1. RANDOM VARIABLES AND DISTRIBUTIONS 109 The following quote indicates the importance of the normal distribution: The normal law of error stands out in the experience of mankind as one of the broadest generalizations of natural philosophy. It serves as the guiding instrument in researches in the physical and social sciences and in medicine, agriculture and engineering. It is an indispensible tool for the analysis and the interpretation of the basic data obtained by observation and experimentation. W. J. Youden 110 CHAPTER 3. PROBABILITY DISTRIBUTIONS The principal characteristics of the normal distribution are • The curve is bell-shaped. • The possible values for x are between −∞ and +∞ • The distribution is symmetric about µ • median = mode (point of maximum height of the curve) • area under the curve is 1. • area under the curve over an interval I gives the probability of I • 68% of the probability is between µ − σ and µ + σ • 95% of the probability is between µ − 2σ and µ + 2σ • 99.7% of the probability is between µ − 3σ and µ + 3σ • For the standard normal distribution we have ◦ P (Z ≥ z) = 1 − P (Z ≤ z) ◦ P (Z ≥ z0 ) = P (Z ≤ −z0 ) for z0 ≥ 0. Thus we have P (Z ≤ 1.645) = .95 P (Z ≥ 1.645) = .05 P (Z ≤ −1.645) = .05 • Probabilities for any normal distribution can be calculated by converting to the standard normal distribution (µ = 0 and σ = 1) as follows: µ P (X ≤ x) = P Z ≤ x−µ σ ¶ 3.1. RANDOM VARIABLES AND DISTRIBUTIONS Plot of Z Distribution Figure 3.3: 111 112 CHAPTER 3. PROBABILITY DISTRIBUTIONS Plots of Normal Distributions Figure 3.4: 3.1. RANDOM VARIABLES AND DISTRIBUTIONS 113 Approximating Binomial Probabilities Using the Normal Distribution If n is large we may approximate binomial probabilities using the normal distribution as follows: 1 x − np + 2 P (X ≤ x) ≈ P Z ≤ q np(1 − p) • The 12 in the approximation is called a continuity correction since it improves the approximation for modest values of n. • A guideline is to use the normal approximation when à p n≥9 1−p ! à 1−p and n ≥ 9 p ! and use the continuity correction. The Supplementary Notes give a brief discussion of the appropriateness of the continuity correction. 114 CHAPTER 3. PROBABILITY DISTRIBUTIONS For the Binomial distribution with n = 30 and p = .3 we find the following probabilities: x P (X = x) P (X ≤ x) 0 0.00002 0.00002 1 0.00029 0.00031 2 0.00180 0.00211 3 0.00720 0.00932 4 0.02084 0.03015 5 0.04644 0.07659 6 0.08293 0.15952 7 0.12185 0.28138 8 0.15014 0.43152 9 0.15729 0.58881 10 0.14156 0.73037 11 0.11031 0.84068 12 0.07485 0.91553 13 0.04442 0.95995 14 0.02312 0.98306 15 0.01057 0.99363 16 0.00425 0.99788 17 0.00150 0.99937 Thus P (Y ≤ 12) is exactly 0.91553. Using the normal approximation without the continuity correction yields a value of 0.88400 Using the continuity correction yields a value of 0.91841, close enough for most work. However, using STATA or other statistical packages makes it easy to get exact probabilities. 3.1. RANDOM VARIABLES AND DISTRIBUTIONS 115 Approximating Poisson Probabilities Using the Normal Distribution If λ ≥ 10 we can use the normal (Z) distribution to approximate the Poisson distribution as follows: à ! x−λ P (X ≤ x) ≈ P Z ≤ √ λ The following are some Poisson probabilities for λ = 10 x 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 P (X = x) 0.00005 0.00045 0.00227 0.00757 0.01892 0.03783 0.06306 0.09008 0.11260 0.12511 0.12511 0.11374 0.09478 0.07291 0.05208 0.03472 0.02170 0.01276 .0070911 .0037322 .0018661 P (X ≤ x) 0.00005 0.00050 0.00277 0.01034 0.02925 0.06709 0.13014 0.22022 0.33282 0.45793 0.58304 0.69678 0.79156 0.86446 0.91654 0.95126 0.97296 0.98572 0.99281 0.99655 0.99841 For y = 15 we find that P (≤ 15) = 0.95126 Using the normal approximation yields a value of 0.94308 A continuity correction can again be used to improve the approximation. 116 3.1.4 CHAPTER 3. PROBABILITY DISTRIBUTIONS Distribution Functions For any random variable the probability that it assumes a value less than or equal to a specified value, say x, is called its distribution function and denoted by F i.e. F (x) = P (X ≤ x) The distribution function F is between 0 and 1 and does not decrease as x increases. The graph of F is a step function for discrete random variables (the height of the step at x is the probability of the value x) and is a differentiable function for continuous random varaibles (the derivative equals the density function). Distribution functions are the model analogue to the empirical distribution function introduced in the exploratory data analysis section. They play an important role in goodness of fit tests and in finding the distribution of functions of continuous random variables. In addition, the natural estimate of the distribution function is the empirical distribution function which forms the basis for the substitution method of estimation. 3.1. RANDOM VARIABLES AND DISTRIBUTIONS 3.1.5 117 Functions of Random Variables It is often necessary to find the distribution of a function of a random variable(s). Functions of Discrete Random Variables In this case to find the pdf of Y = g(X) we find the probability density function directly using the formula f (y) = P (Y = y) = P ({x : g(x) = y}) Thus if X has a binomial pdf with parameters n and p and represents the number of successes in n trials what is the pdf of Y = n − X, the number of failures? We find that à ! à ! n n P (Y = y) = P ({x : x = n − y}) = pn−y (1 − p)n−(n−y) = (1 − p)y pn−y n−y y i.e. binomial with parameters n and 1 − p. 118 CHAPTER 3. PROBABILITY DISTRIBUTIONS Functions of Continuous Random Variables Here we find the distribution function of Y P (Y ≤ y) = P ({x : g(x) ≤ y}) and then differentiate to find the density function of Y . example: Let Z be standard normal and let Y = Z 2 . The distribution function of Y is given by Z √y √ √ F (y) = P (Y ≤ y) = P ({z : − y ≤ z ≤ y}) = √ φ(z)dz − y where φ(z) is the standard normal density i.e. φ(z) = (2π)−1/2 e−z 2 /2 It follows that the density function of Y is equal to dF (y) 1 1 √ √ = √ φ( y) + √ φ(− y) dy 2 y 2 y or 1 √ y 1/2−1 e−y/2 √ f (y) = √ ( 2π)−1/2 e−y/2 = y 21/2 π which is called the chi-square distribution with one degree of freedom. That is, if Z is standard normal then Z 2 is chi-square with one degree of freedom. 3.1.6 Other Distributions A variety of other distributions arise in statistical problems. These include the log-normal, the chi-square, the Gamma, the Beta, the t, the F , and the negative binomial. We will discuss these as they arise. 3.2. PARAMETERS OF DISTRIBUTIONS 3.2 119 Parameters of Distributions 3.2.1 Expected Values In exploratory data analysis we emphasized the importance of a measure of location (center) and spread (variability) for a batch of numbers. There are analagous measures for probability distributions. Definition: The expected value, E(X), of a random variable is the weighted average of its values, the weights being the probability assigned to the values. • For a discrete random variable we have E(X) = X xp(x) x where p(x) is the probability density function of X. • For continuous random variables Z E(X) = x xf (x)dx Some important expected values are: (1) The expected value of the binomial distribution is np (2) The expected value of the Poisson distribution is λ (3) The expected value of the normal distribution is µ 120 CHAPTER 3. PROBABILITY DISTRIBUTIONS Using the properties of sums and integrals we have the following properties of expected values • E(c) = where c is a constant. In words: The expected value of a constant is equal to the constant. • E(cX) = cE(X) where c is a constant. In words: The expected value of a constant times a random variable is equal to the constant times the expected value of the random variable. • E(X + Y ) = E(X) + E(Y ) In words: The expected value of the sum of two random variables is the sum of their expected values. • If X ≥ 0 then E(X) ≥ 0 In words: The expected value of a non-negative random variable is nonnegative. Note: The result that the expected value of the sum of two random variables is the sum of their expected values is non trivial in the sense that one must show that the distribution of the sum has expected value equal to the sum of the individual expected values. 3.2. PARAMETERS OF DISTRIBUTIONS 3.2.2 121 Variances Definition: The variance of a random variable is var (X) = E(X − µ)2 where µ = E(X) • If we write X = µ + (X − µ) or X = µ + error we see that the variance of a random variable is a measure of the average size of the squared error made when using µ to predict the value of X. • The square root of var (X) is called the standard deviation of X and is used as a basic measure of variability for X. (1) For the binomial var (X) = npq where q = 1 − p (2) For the Poisson var (X) = λ (3) For the normal var (X) = σ 2 Using the properties of sums and integrals we have the following properties of variances: • var (c) = 0 where c is a constant. In words: The variance (variability) of a constant is 0. • var (c + X) = var (X) where c is a constant. In words: The variance of a random variable is unchanged by the addition of a constant. • var (cX) = c2 var (X) where c is a constant. In words: The variance of a constant times a random variable equals the constant squared times the variance of the random variable. • var (X) ≥ 0 In words: The variance of a random variable cannot be negative. 122 3.2.3 CHAPTER 3. PROBABILITY DISTRIBUTIONS Quantiles Recall that • The median of a batch of numbers is the value which divides the batch in half. • Similarly the upper quartile has one fourth of the numbers above it while the lower quartile has one fourth of the numbers below it. • There are analogs for probability distributions of random variables. Definition: The pth quantile, Qp of X is defined by P (X ≤ Qp ) = p where 0 < p < 1. • Q.5 is called the median of X • Q.25 is called the lower quartile of X • Q.75 is called the upper quartile of X • Q.75 − Q.25 is called the interquartile range of X 3.2. PARAMETERS OF DISTRIBUTIONS 3.2.4 123 Other Expected Values If Y = g(X) is a function of X then Y is also a random variable and has expected value given by ( P g(x)f (x) if X is discrete E[Y ] = E[g(X)] = R x x g(x)f (x)dx if X is continuous Definition: The moment generating function of X, M (t), is defined as the expected value of Y = etX where t is a real number. The moment generating function has two important theoretical properties: (1) The rth derivative of M (t) with respect to t, evaluated at t = 0 gives the rth moment of X, E(X r ) for any integer r. This often provides an easy method to find the mean, variance, etc. of a random variable. (2) The moment generating function is unique: that is, if two distributions have the same moment generating function then they have the same distribution. example: For the binomial distribution we have that tX M (t) = E[e ] = n X x=0 à ! e tx à ! n X n x n n−x p (1 − p) = (pet )x (1 − p)n−x = (pet + q)n x x=0 x where q = 1 − p. The first and second derivatives are d2 M (t) dt2 dM (t) = npet (pet + q)n−1 dt 2 2t t n−2 = n(n − 1)p e (pe + q) + npet (pet + q)n−1 Thus we have E(X) = np ; E(X 2 ) = n(n − 1)p2 + np and hence var (X) = n(n − 1)p2 + np − (np)2 = np(1 − p) example: For the Poisson distribution we have that M (t) = E(etX ) = ∞ X x=0 etx e−λ ∞ X λx (λet )x t == e−λ = eλ(e −1) x! x! x=0 124 CHAPTER 3. PROBABILITY DISTRIBUTIONS The first and second derivatives are dM (t) dt 2 M (t) = λet M (t) d dt = λ2 et M (t) + λet M (t) 2 Thus we have E(X) = λ ; E(X 2 ) = λ2 + λ and hence var (X) = (λ2 + λ) − λ2 = λ example: For the normal distribution we have that M (t) = exp{tµ + t2 σ 2 /2} The first two derivatives are dM (t) dt 2 M (t) = (µ + tσ 2 )M (t) d dt = (µ + tσ 2 )2 M (t) + (σ 2 )M (t) 2 Thus we have E(X) = µ ; E(X 2 ) = µ2 + σ 2 and hence var (X) = (µ2 + σ 2 ) − µ2 = σ 2 3.2. PARAMETERS OF DISTRIBUTIONS 3.2.5 125 Inequalities involving Expectations Markov’s Inequality: If Y is any non-negative random variable then P (Y ≥ c) ≤ E(Y ) c where c is any positive constant. To see this define a discrete random variable by the equation ( Z= c if Y ≥ c 0 if Y < c Note that Z ≤ Y so that E(Y ) ≥ E(Z) = 0P (Z = 0) + cP (Z = c) = cP (Y ≥ c) Tchebychev’s Inequality: If X is any random variable then P (−δ < X − µ < δ) ≥ 1 − σ2 δ2 where σ 2 is the variance of X and δ is any positive number. To see this define Y = (|X − µ|)2 Then Y is non-negative with expected value equal to σ 2 and by Markov’s Inequality we have that σ2 P (Y ≥ δ 2 ) ≤ 2 δ and hence σ2 σ2 1 − P (Y < δ 2 ) ≤ 2 or P (Y < δ 2 ) ≥ 1 − 2 δ δ But P (Y < δ 2 ) = P (|X − µ| < δ) = P (−δ < |X − µ| < δ) so that P (−δ < X − µ < δ) ≥ 1 − σ2 δ2 126 CHAPTER 3. PROBABILITY DISTRIBUTIONS example: Consider n Bernoulli trials and let Sn be the number of successes. Then X = Sn /n has µ ¶ µ ¶ Sn np Sn npq pq E = = p and var = 2 = n n n n n Thus Tchebychev’s Inequality says that µ 1 ≥ P −δ < ¶ Sn pq −p<δ ≥1− 2 n nδ In other words, if the number of trials is large, the probability that the observed frequency of successes will be close to the true probability of success is close to 1. This is used as the justification for the relative frequency interpretation of probability. It is also a special case of the Weak Law of large Numbers. Chapter 4 Joint Probability Distributions 4.1 General Case Often we want to consider several responses simultaneously. We model these using random variables X1 , X2 , . . . and we have joint probability distributions. There are again two major types. (i) Joint discrete distributions have the property that the sample space for each random variable is discrete and probabilities are assigned using the joint probability density function defined by 0 ≤ f (x1 , x2 , . . . , xk ) ≤ 1 ; XX x1 x2 ··· X f (x1 , x2 , . . . , xk ) = 1 xk (ii) Joint continuous distributions have the property that the sample space for each random variable is continuous and probabilities are assigned using the probability density function which has the properties that Z f (x1 , x2 , . . . , xk ) ≥ 0 x1 Z x2 Z ··· xk 127 f (x1 , x2 , . . . , xk )dx1 dx2 · · · dxk = 1 128 4.1.1 CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS Marginal Distributions Marginal distributions are distributions of subsets of random variables which have a joint distribution. In particular the marginal distribution of one of the components, say Xi , is said to be the marginal distribution of Xi . Marginal distributions are obtained by “summing” or “integrating” out the other variables in the joint density. Thus if X and Y have a joint distribution which is discrete the marginal distribution of X is given by fX (x) = X f (x, y) y If X and Y have a joint distribution which is continuous the marginal distribution of X is given by Z fX (x) = f (x, y)dy y 4.1.2 Conditional Distributions Conditional distributions are distributions of subsets of random variables which have a joint distribution given that other components of the random variables are fixed. The conditional distribution of Y given X = x is obtained by fY |X (y|x) = f (y, x) fX (x) where f (y, x) is the joint distribution of Y and X and fX (x) is the marginal distribution of X. Conditional distributions are of fundamental importance in regression and prediction problems. 4.1. GENERAL CASE 4.1.3 129 Properties of Marginal and Conditional Distributions • The joint distribution of X1 , X2 , . . . , Xk can be obtained as f (x1 , x2 , . . . , xk ) = f1 (x1 )f2 (xx |x1 )f3 (x3 |x1 , x2 ) · · · fk (xk |x1 , x2 , . . . , xk−1 ) which is a generalization of the multiplication rule for probabilities. • The marginal distribution of Y can be obtained via the formula ( fY (y) = P x f (y|x)fX (x) if X, Y are discrete f (y|x)f X (x)dx if X, Y are continuous x R which is is a generalization of the law of total probability. • The conditional density of y given X = x can be obtained as fY |X (y|x) = fY (y)fX|Y (x|y) f (y, x) = fX (x) fX (x) which is a version of Bayes Theorem. 4.1.4 Independence and Random Sampling If X and Y have a joint distribution they are independent if f (x, y) = fX (x)fY (y) or if fY |X (y|x) = fY (y) In general X1 , X2 , . . . , Xn are independent if f (x1 , x2 , . . . , xn ) = fX1 (x1 )fX2 (x2 ) · · · fXn (xn ) i.e. the joint distribution is the product of the marginal distributions. Definition: We say that x1 , x2 , . . . , xn constitute a random sample from f if they are realized values of independent random variables X1 , X2 , . . . , Xn , each of which has the same probability distribution f . Random sampling from a distribution is fundamental to many applications of modern statistics. 130 4.2 CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS The Multinomial Distribution The most important joint discrete distribution is the multinomial defined as f (x1 , x2 , · · · , xk ) = n! k Y pxi i i=1 where xi ! P k xi = 0, 1, 2, . . . , n , i = 1, 2, . . . , k , x =n Pk i=1 i 0 ≤ pi ≤ 1 , i = 1, 2, . . . , k , i=1 pi = 1 The multinomial is the basis for the analysis trials where the outcomes are not binary but of k distinct types and in the analysis of tables of data which consist of counts of the number of times certain response patterns occur. Note that if k = 2 the multinomial reduces to the binomial. example: Suppose we are interested in the daily pattern of “accidents” in a manufacturing firm. Assuming individuals in the firm have accidents independent of others then the probability of accidents by day has the multinomal distribution P (x1 , x2 , x3 , x4 , x5 ) = n! px1 px2 px3 px4 px5 x1 !x2 !x3 !x4 !x5 ! 1 2 3 4 5 where pi is the probability of an accident on day i and i indexes working days. Of interest is whether or not the pi are equal. If they are not we might be interested in which seem too large. 4.2. THE MULTINOMIAL DISTRIBUTION 131 example: This data set consists of the cross classification of 12,763 applications for admission to graduate programs at the University of California at Berkeley in 1973. The data were classified by gender and admission outcome. Of interest is the possibility of gender bias in the admissions policy of the university. Gender Male Female Admissions Outcome Admitted Not Admitted 3738 4704 1494 2827 In general we have that n individuals are investigated and their gender and admission outcome is recorded. The data are thus of the form: Gender Male Female Admitted n00 n10 Not Admitted n01 n11 To model this data we assume that individuals are independent and that the possible response patters for an individual are given by one of the following: (male, admitted) = (0, 0) (female, admitted) = (1, 0) (male, not admitted) = (0, 1) ( female, not admitted) = (1, 1) Denoting the corresponding probabilities by p11 , p01 , p10 and p00 the multinomial model applies and we have the probabilities of the observed responses given by n! pn00 pn01 pn10 pn11 n00 !n01 !n10 !n11 ! 00 01 10 11 The random variables are thus N00 , N01 , N10 and N11 . 132 CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS In the model above the probabilities are thus given by Gender Male Female Marginal of Admission Status Admitted p00 p10 p+0 Not Admitted p01 p11 p+1 Marginal of Gender p0+ p1+ 1 Note that p+0 gives the probability of admission and that p0+ gives the probability of being male. It is clear (why?) that the marginal distribution of admission is binomial with parameters n and p = p+0 . The probability that N00 = n00 and N01 = n01 given that N00 + N01 = n0+ gives the probability of admission given male and is P (N00 = n00 , N01 = n01 |N00 + N01 = n0+ ) This conditional probability is given by: P (N00 = n00 , N01 = n0+ − n00 ) = P (N00 + N01 = n0+ ) n! (n0+ −n00 )!n00 !(n−n0+ )! n! n0+ !(n−n0+ )! à n −n pn0000 p010+ 00 (1 − p0+ )n−n0+ n p0+0+ (1 − p0+ )n−n0+ !n p00 00 n0+ ! = n00 !(n0+ − n00 )! p0+ à ! n0+ n00 = p (1 − p∗ )n0+ −n00 n00 ∗ à p01 p0+ !n0+ −n00 which is a binomial distribution with parameters n0+ , the number male and p00 p∗ = p0+ Note that the odds of admission given male are p∗ p00 = 1 − p∗ p01 Similarly the probability of admission given female is binomial with parameters n1+ , the number of females and P∗ where p10 P∗ = p1+ Note that the odds in this case are given by p10 P∗ = 1 − P∗ p11 4.2. THE MULTINOMIAL DISTRIBUTION 133 Thus the odds ratio of admission (female to male) is given by p10 /p11 p01 p10 = p00 /p01 p00 p11 If the odds ratio is one gender and admission are independent. (Why?) It follows that the odds ratio is a natural measure of association for categorical data. In the example the odds of admission for males is estimated by odds of admission for males = 3738/8442 3738 = = 0.79 4704/8442 4704 while the odds for admission given female is odds of admission given female = 1494/4321 1494 = = 0.53 2827/4321 2827 Thus the odds of admission are lower for females. The odds ratio is estimated by odds ratio of admission (females to males) = 1494/2827 1494 × 4704 = = 0.67 3738/4704 2827 × 3738 Is this odds ratio different enough from 1 to claim that females are discriminated against in the admissions policy? More later!!! 134 4.3 CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS The Multivariate Normal Distribution The most important joint continuous distribution is the multivariate normal distribution. The density function of Y is given by ½ ¾ k 1 1 f (x) = (2π)− 2 [det(V)]− 2 exp − (x − µ)T V−1 (x − µ) 2 where we assume that V is a non-singular, symmetric, positive definite matrix of rank k. The two parameters of this distribution are µ and V. • It can be shown that the marginal distribution of any Xi µi and vii where µ1 v11 v12 · · · µ2 v12 v22 · · · µ= .. .. .. , V = .. . . . . µk v1k v2k · · · is normal with parameters v1k v2k .. . vkk • It can also be shown that the distribution of linear combinations of multivariate normal random variables are also multivariate normal. More precisely let W = a + BY where a is p × 1 and B is a p × k matrix with p ≤ k. Then the joint distribution of W is multivariate normal with parameters µW = a + BµY and VW = BVY BT where BT is the transpose of B. 4.3. THE MULTIVARIATE NORMAL DISTRIBUTION 135 • It can also be shown that the conditional distribution of any subset of X given any other subset is multivariate normal more precisely: let " X= X1 X2 # " ; µ= µ1 µ2 # " , V= V11 V12 T V12 V22 # where AT denotes the transpose of A. Then the conditional distribution of X2 given X1 = x1 is also multivariate normal with −1 T −1 T V12 V11 (x1 − µ1 ) ; V∗ = V22 − V12 V11 µ∗ = µ2 + V12 • It follows that if X1 and X2 have a multivariate normal distribution then they are independent if and only if V12 = 0 The multivariate normal distribution forms the basis for regression analysis, analysis of variance and a variety of other statistical methods including factor analysis and latent variable analysis. 136 CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS 4.4 4.4.1 Parameters of Joint Distributions Means, Variances, Covariances and Correlation The collection of expected values of the marginal distributions of Y is called the expected value of Y and is written as E(Y) = µ = E(Y1 ) E(Y2 ) .. . = E(Yk ) µ1 µ2 .. . µk The covariance between X and Y , where X and Y have a joint distribution is defined by cov (X, Y ) = E(X − µX )(Y − µY ) The correlation between X and Y is defined as cov (X, Y ) ρ(X, Y ) = q var (X)var (Y ) and is simply a standardized covariance. Correlations have the property that −1 ≤ ρ(X, Y ) ≤ 1 4.4. PARAMETERS OF JOINT DISTRIBUTIONS 137 Using the properties of expected values we see that covariances have the following properties • cov (X, Y ) = cov (Y, X) • cov (X, X) = var (X) • cov (X + a, Y + b) = cov (X, Y ) • cov (aX, bY ) = abcov (X, Y ) • cov (aX + bY, cW + dZ) = ac cov (X, W ) + ad cov (X, Z) + bc cov (Y, W ) + bd cov (Y, Z) We define the variance covariance matrix of Y as VY = var (Y1 ) cov (Y1 , Y2 ) cov (Y1 , Y1 ) var (Y2 ) .. .. . . cov (Yk , Y1 ) cov (Yk , Y2 ) · · · cov (Y1 , Yk ) · · · cov (Y2 , Yk ) .. .. . . · · · var (Yk , Yk ) Note that for the multivariate normal distribution with parameters µ and V we have that E(Y) = µ and VY = V Thus the two parameters in the multivariate normal are respectively the mean vector and the variance covariance matrix. 138 4.4.2 CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS Joint Moment Generating Functions The joint moment generating function of X1 , X2 , . . . , Xk is defined as Pk MX (t) = E(e tX i=1 i i ) • Partial derivatives with respect to ti evaluated at t1 = t2 = · · · = tk = 0 give the moments of Xi and mixed partial derivatives (e.g. with respect to ti and tj give the covariances, etc.) • Joint moment generating functions are unique (if two distributions have the same moment generating function then the two distributions are the same). • The joint moment generating function for the multivariate normal distribution is given by ½ ¾ k k X k X X 1 1 MX (t) = exp µT t + tT Vt = exp ti µi + ti tj vij 2 2 i=1 j=1 i=1 • If random variables are independent then their joint moment generating function is equal to the product of the individual moment generating functions. 4.5. FUNCTIONS OF JOINTLY DISTRIBUTED RANDOM VARIABLES 4.5 139 Functions of Jointly Distributed Random Variables If Y = g(X) is any function of random variables X we can find its distribution exacly as in the one variable case i.e. fY (y) = P x:g(x)=y fY (y) = where f (x1 , x2 , . . . , xk ) if X is discrete dFY (y) dy if X is continuous Z FY (y) = {x:g(x)≤y} f (x1 , x2 , . . . , xk )dx1 dx2 · · · dxk Thus we can find the distribution of the sum, the difference, a linear combination, a ratio, a product, etc. We shall not derive all of the results we use in later sections but we shall record a few of the most important results here • If X has a multivariate normal distribution with mean µ and variance covariance matrix V then the distribution of Y = a + bT X = a + k X bi Xi i=1 is normal with E(Y ) = a + bT µ = a + k X i=1 bi E(Xi ) and var (Y ) = bT Vb = k X k X i=1 j=1 bi bj cov (Xi , Xj ) 140 CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS • If Z1 , Z2 , . . . , Zr are independent each N (0, 1) then the distribution of Z12 + Z22 + · · · + Zr2 is chi-square with r degrees of freedom. • If Z is N (0, 1) and W is chi-square with r degrees of freedom and Z and W are independent then Z T =q W/r has a Student’s t distribution with r degrees of freedom. • If Z1 and Z2 are each N (0, 1) and independent then the distribution of the ratio C= is Cauchy with parameters 0 and 1 Z1 Z2 4.5. FUNCTIONS OF JOINTLY DISTRIBUTED RANDOM VARIABLES 4.5.1 141 Linear Combinations of Random Variables If X1 , X2 , . . . , Xn have a joint distribution with parameters µ1 , µ2 , . . . , µn and variances and covariances given by cov (Xi , Xj ) = vij then the expected value of Pn i=1 ai Xi is given by n X E( ai Xi ) = i=1 and the variance of Pn var i=1 ai E(µi ) = i=1 n X ai µi i=1 ai Xi is given by à n X ! ai Xi = i=1 If we write n X n X n X ai aj cov (Xi , Xj ) = i=1 j=1 µ= µ1 µ2 .. . µn n X n X ai aj vij i=1 j=1 ; V= v11 v12 v21 v22 .. .. . . vn1 vn2 · · · v1n · · · v2n . . . .. . · · · vnn we see that the above results may be written as E(aT X) = aT µ ; var (aT X) = aT Va 142 CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS As special cases we have • var (X + Y ) = var (X) + var (Y ) + 2 cov (X, Y ) • var (X − Y ) = var (X) + var (Y ) − 2 cov (X, Y ) • Thus if X and Y are uncorrelated with the same variance σ 2 we hav – var (X + Y ) = 2σ 2 – var (X − Y ) = 2σ 2 • More generally if X1 , X2 , . . . , Xn are uncorrelated then var à n X ! ai Xi = i=1 – In particular if we take each ai = n X a2i var (Xi ) i=1 1 n we have σ2 var (X̄) = n 4.6. APPROXIMATE MEANS AND VARIANCES 4.6 143 Approximate Means and Variances In some problems we cannot find the expected value or variance or distribution of Y = g(X) exactly. It is useful to have approximations for the means and variances in such cases. If the function g is reasonably linear in a neignorhood of µX , the expected value of X then we can write Y = g(X) ≈ g(µ) + g (1) (µX )(X − µX ) by Taylor’s Theorem. Hence we have E(Y ) ≈ g(µX ) 2 var (Y ) ≈ [g (1) (µX )]2 σX We can get an improved approximation to the expected value of Y by writing 1 Y = g(X) ≈ g(µ) + g (1) (µX )(X − µX ) + g (2) (µX )(X − µX )2 2 Thus 1 2 E(Y ) ≈ g(µX ) + g (2) (µX )σX 2 144 CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS If Z = g(X, Y ) is a function of two random variables then we can write Z = g(X, Y ) ≈ g(µ) + ∂g(µ) ∂g(µ) (X − µX ) + (Y − µY ) ∂x ∂y where µ denotes the point (µX , µY ) and ¯ ∂g(µ) ∂g(x, y) ¯¯ ¯ = ∂x ∂x ¯x=µX ,y=µY Thus we have that var (Z) ≈ h i ∂g(µ) 2 ∂x 2 σX + h E(Z) ≈ g(µ) i ∂g(µ) 2 ∂y σy2 + 2 h ∂g(µ) ∂x ih ∂g(µ) ∂y i cov (X, Y ) As in the single variable case we can obtain an improved approximation for the expected value by using Taylor’s Theorem with second order terms e.g. " # " # " # 1 ∂ 2 g(µ) 2 1 ∂ 2 g(µ) 2 ∂ 2 g(µ) E(Z) ≈ g(µ) + σ + σ + cov (X, Y ) X Y 2 ∂x2 2 ∂y 2 ∂x∂y • Note 1: The improved approximation is needed for the expected value because in general E[g(X)] 6= g(µ) i.e. E(X 2 ) 6= µ2 • Note 2: Some care is needed when working with discrete variables and certain functions. Thus if X is binomial with parameters n and p the expected value of log(X) is not defined so that no approximation can be correct. 4.7. SAMPLING DISTRIBUTIONS OF STATISTICS 4.7 145 Sampling Distributions of Statistics Definition: A statistic is a numerical quantity calculated from a set of data. Typically a statistic is designed to provide information about some parameter of the population. • If x1 , x2 , . . . , xn is the data some statistics are – x̄, the sample mean – the median – the upper quartile – s2 , the sample variance – the range • Since the data are realized values of random variables a statistic is also realized value of a random variable. • The probability distribution of this random variable is called the sampling distribution of the statistic. 146 CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS In most contemporary applications of statistics the sampling distribution of the statistic is used to assess the performance of a statistic used for inference about population parameters. The following is a schematic diagram of the concept of a sampling distribution of a statistics. experiment. Figure 4.1: 4.7. SAMPLING DISTRIBUTIONS OF STATISTICS Illustration of Sampling Distributions Sampling Distribution of Sample Mean, Sample Size 25 Figure 4.2: 147 148 CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS Illustration of Sampling Distributions Sampling Distribution of (n − 1)s2 /σ 2 , Sample Size n = 10 Figure 4.3: 4.7. SAMPLING DISTRIBUTIONS OF STATISTICS Illustration of Sampling Distributions √ Sampling Distribution of t = n(x̄ − µ)/s, Sample Size n = 10 Figure 4.4: 149 150 CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS example: Given a sample of data suppose we calculate the sample mean x̄ and the sample median q.5 . Which of these is a better measure of the center of the population? • If we assume that the data represent a random sample from a probability distribution which is N (µ, σ 2 ) then it is known that: 2 ◦ the sampling distribution of X̄ is N (µ, σn ) 2 ◦ the sampling distribution of the sample median is approximately N (µ, ( π2 ) σn ). • Thus the sample mean will, on average, be closer to the population mean than will the sample median. Thus the sample mean is preferred as as estimate of the population mean. • If the underlying population is not N (µ, σ 2 ) then the above result does not hold and the sample median may be the preferred estimate. • It follows that the role of assumptions about the underlying probability model is crucial in the development and assesment of statistical procedures. 4.8. METHODS OF OBTAINING SAMPLING DISTIBUTIONS OR APPROXIMATIONS151 4.8 Methods of Obtaining Sampling Distibutions or Approximations There are three methods used to obtain information on sampling distributions: • Exact sampling distributions. Statisticians have, over the last 100 years, developed the sampling distributions for a variety of useful statistics for specific parametric models. For the most part these statistics are simple functions of the sample data such as the sample mean, the sample variance, etc. • Asymptotic (approximate) distributions. When exact sampling distributions are not tractable we may find the distribution of the statistic for large sample sizes. These are called asymptotic methods and are suprisingly useful. • Computer intensive methods. These are based on resampling from the empirical distribution of the data and have been shown to have useful properties. The most important of these methods is called the bootstrap. 4.8.1 Exact Sampling Distributions Here we find the exact sampling distribution of the statistic using the methods previously discussed. The most famous example of this method is the result that if we have a random sample for a normal distribution then the distribution of the sample mean is also normal. Other examples include the distribution of the sample variance from a normal sample, the t distribution and the F distribution. 152 CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS 4.8.2 Asymptotic Distributions 4.8.3 Central Limit Theorem If we cannot find the exact sampling distribution of a statistic we may be able to find its mean and variance. If the sampling distribution were approximately normal then we would be able to make approximate statements using just the mean and variance. In the discussion of the Binomial and Poisson distributions we noted that for large n the distributions could be approximated by the normal distribution. • In fact, the sampling distribution of X̄ for almost any population distribution becomes more and more similar to the normal distribution regardless of the shape of the original distribution as n increases. More precisely: Central Limit Theorem If X1 , X2 , . . . , Xn are independent each with the same distribution having expected value µ and variance σ 2 then the sampling disribution of X̄ is approximately 2 N (µ, σn ) i.e. X̄ − µ P σ ≤ z ∼ P (Z ≤ z) √ n where P (Z ≤ z) is the area under the normal curve up to z. The Central Limit Theorem has been extended and refined over the last 75 years. • Many statistics have distributions whose sampling distributions are approximately normal. • This explains the great use of the normal distribution in statistics. • In particular, whenever a measurement can be thought of as a sum of individual components we may expect it to be approximately normal. 4.8. METHODS OF OBTAINING SAMPLING DISTIBUTIONS OR APPROXIMATIONS153 4.8.4 Central Limit Theorem Example We now illustrate the Central Limit Theorem and some other results on sampling distributions. The data set consists of a population of 1826 children whose blood lead values (milligarms per deciliter) were recorded at the Johns Hopkins Hospital. The data are courtesy of Dr. Janet Serwint. Lead in children is a serious public health problem, lead levels exceeding 15 milligrams per deciliter are considered to have implications for learning disabilities, are implicated in violent behavior and are the concern of major governmental efforts aimed at reducing exposure. The distribution in real populations is often assumed to follow a log-normal distribution i.e. the natural logarithm of blood lead values is normally distributed. Note the asymmetry of the distribution of blood lead values. Note that the log transformation results in a decided improvement in symmetry, indicating that the log-normal assumption is probably appropriate. We select random samples from the population of blood lead readings and log blood lead readings. We select 100 random samples of size 10, 25 and 100 respectively. As the histograms indicate the distribution of the sample means of the blood lead values do indeed appear normal even though the distribution of blood lead values is highly skewed. 154 CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS Histograms of Blood Lead and Log Blood Lead Values Figure 4.5: 4.8. METHODS OF OBTAINING SAMPLING DISTIBUTIONS OR APPROXIMATIONS155 Histograms of Sample Means of Blood Lead Values Figure 4.6: 156 CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS Histograms of Sample Means of Log Blood Lead Values Figure 4.7: 4.8. METHODS OF OBTAINING SAMPLING DISTIBUTIONS OR APPROXIMATIONS157 The summary statistics for blood lead values and the samples are as follows > summary(blpb) Min. 1st Qu. Median Mean 3rd Qu. Max. 0 5 8 9.773 12 128 > var(blpb) 71.79325 Sample Size 10 25 100 Mean 9.93 9.75 9.87 Variance 9.53 2.91 .72 The summary statistics for log blood lead values and the samples are as follows summary(logblpb) Min. 1st Qu. Median Mean 3rd Qu. Max. -1.386 1.658 2.11 2.084 2.506 4.854 > var(logblpb) [1] 0.4268104 Sample Size 10 25 100 Mean 2.07 2.08 2.08 Variance .037 .017 .004 158 4.8.5 CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS Law of Large Numbers Under most weak conditions, the average of a sample is “close” to the population average if the sample size is large. More precisely: Law of Large Numbers: If we have a random sample X1 , X2 , . . . , Xn from a distribution with expected value µ and variance σ 2 then P (X̄ ≈ µ) ≈ 1 for n sufficiently large. The approximation becomes closer the larger the value of n. p We write X̄ −→ µ and say that X̄ converges in probability to µ. If g is a continuous function then if X converges in probability to µ then g(X) converges in probability to g(µ). Some idea of the value of n needed can be obtained from Chebychev’s inequality which states that σ2 P (−k ≤ X̄ − µ ≤ k) ≥ 1 − n k2 where k is any constant. 4.8. METHODS OF OBTAINING SAMPLING DISTIBUTIONS OR APPROXIMATIONS159 Law of Large Numbers Examples Figure 4.8: 160 CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS 4.8.6 The Delta Method - Univariate For statistics Sn which are normal or approximately normal the delta method can be used to find the approximate distribution of g(Sn ), a function of Sn . The technique is based on approximating g by a linear function as in obtaining approximations to expected values and variances of functions i.e. g(Sn ) ≈ g(µ) + g (1) (µ)(Sn − µ) where Sn converges in probability to µ and g (1) (µ) is the derivative of g evaluated at µ. Thus we have that g(Sn ) − g(µ) ≈ g (1) (µ)(Sn − µ) √ If n(Sn − µ) has an exact or approximate normal distribution with mean 0 and variance σ 2 then √ n[g(Sn ) − g(µ)] has an approximate normal distribution with mean 0 and variance [g (1) (µ)]2 σ 2 It follows that we may make approximate calculations by treating g(Sn ) as if were normal with mean g(µ) and variance [g (1) (µ)]2 σ 2 /n i.e. g(Sn ) − g(µ) x − g(µ) x − g(µ) = P Z ≤ q P (g(Sn ) ≤ s) = P q ≤q 2 2 2 2 2 2 (1) (1) (1) [g (µ)] σ /n [g (µ)] σ /n [g (µ)] σ /n where Z is N (0, 1). in addition if g (1) (µ) is continous then we can replace µ by Sn in the formula for the variance. 4.8. METHODS OF OBTAINING SAMPLING DISTIBUTIONS OR APPROXIMATIONS161 example: Let X be binomial with parameters n and p and let Sn = X/n. Then we know by the Central Limit Theorem that the approximate distribution of √ n(Sn − p) is N (0, pq). If we define µ ¶ x = ln(x) − ln(1 − x) g(x) = ln 1−x then g (1) (x) = 1 1 (1 − x) + x 1 + = = x (1 − x) x(1 − x) x(1 − x) Thus g (1) (µ) = and hence √ " µ 1 pq à ¶ Sn p n ln − ln 1 − Sn 1−p !# is approximately normal with mean 0 and variance pg 1 1 = + 2 (pq) p q Since g (1) (µ) is continuous we may treat ln(Sn ) as if it where normal with à p mean ln 1−p ! · and variance ¸ 1 1 1 1 1 + = + n Sn 1 − Sn X n−X Thus the distribution of the sample log odds in a binomial may be approximated by a normal distribution with mean equal to the population log odds and variance equal to the sum of the reciprocals of the number of successes and the number of failures. 162 CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS 4.8.7 The Delta Method - Multivariate More generally, if we have a collection of statistics S1 , S2 , . . . , Sk then we say that they are approximately multivariate normally distributed with mean µ and variance covariance matrix V if √ T n a (Sn − µ) has an approximate normal distribution with mean 0 and variance aT Va for any a. In this case the distribution of g(Sn ) is also approximately normal i.e. √ n [g(Sn ) − g(µ)] is approximately normal with mean 0 and variance σg2 = ∇(µ)T V∇(µ) where ∇(µ) = ∂g(µ) ∂µ1 ∂g(µ) ∂µ2 .. . ∂g(µ) ∂µk 2 Thus we may make approximate calculations by treating g(Sn ) as if were normal with with mean g(µ) and variance σg2 i.e. s − g(µ) x − g(µ) g(Sn ) − g(µ) ≤ q = P Z ≤ q P (g(Sn ) ≤ s) = P q 2 2 σg /n σg /n σg2 /n where N (0, 1). In addition if each partial derivative is continuous we may replace µ by Sn in the formula for the variance. 4.8. METHODS OF OBTAINING SAMPLING DISTIBUTIONS OR APPROXIMATIONS163 example: Let X1 be binomial n and p1 and let X2 be binomial n and p2 and be independent. Then then the joint distribution of " S1n S2n Sn = is such that √ # " = X1 /n X2 /n # n(Sn − p) is approximately multivariate normal with mean 0 and variance covariance matrix V where " V= # p 1 q1 0 0 p2 q2 Thus if à p2 g(p) = ln 1 − p2 ! à p1 − ln 1 − p1 we have that ! = ln(p2 ) − ln(1 − p2 ) − ln(p1 ) + ln(1 − p1 ) ∂g(p) = − p11 ∂p1 ∂g(p) = p12 ∂p2 − + 1 1−p1 1 1−p2 = − p11q1 = p21q2 It follows that σg2 = h − p11q1 1 p2 q2 i " p 1 q1 0 0 p2 q2 #" − p11q1 # 1 p2 q2 = 1 1 + p1 q1 p2 q2 Since the partial derivatives are continuous we may treat the sample log odds ratio as if it where normal with mean equal to the population log odds ratio à p2 /(1 − p2 ) ln p1 /(1 − p1 ) and variance ! 1 1 1 1 + + + X1 n − X1 X2 n − X2 164 CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS If we write the sample data as sample 1 sample 2 X1 = a n − X1 = b X2 = c n − X2 = d then the above formula reads as 1 1 1 1 + + + a b c d a very widely used formula in epidemiology. Technical Notes (1) Only a minor modification is needed to show that the result is true when the sample size in the two binomials is different provided that the ratio of the sample sizes does not tend to 0. (2) The log odds ratio is much nearly normally distributed than the odds ratio. We generate 1000 samples of size 20 from each of two binomial populations one with parameter .3 and the other with parameter .5. It follows that the population odds ratio and the population log odds ratio are given by odds ratio = .5/.5 7 = = 2.333 ; log odds ratio = .8473 .3/.7 3 The asymptotic variance for the log odds ratio is given by the formula (1/6) + (1/14) + (1/10) + (1/10) = .4381 which leads to an asymptotic standard deviation of .6618. The mean of the 1000 random samples is .9127 with variance .5244 and standard deviation .7241. 4.8. METHODS OF OBTAINING SAMPLING DISTIBUTIONS OR APPROXIMATIONS165 Graphs of the Simulated Distributions Figure 4.9: 166 4.8.8 CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS Computer Intensive Methods • The determination of the sampling distribution of statistics which are complicated functions of the observations can be approximated using the Delta Method. • With the advent of fast modern computing techniques other methods of obtaining sampling distributions have been developed. One of these, called the bootstrap is of great importance in estimation and in interval estimation. The Bootstrap Method Given data x1 , x2 , . . . , xn , a random sample from p(x; θ) we estimate θ by the statistic θ̂. Of interest is the standard error of θ̂. We may not be able to obtain the standard error if θ̂ is a complicated function of the data, nor do we want an asymptotic result which may be suspect if used for small samples. The bootstrap method, introduced in 1979 by Bradley Efron, is a computer intensive method for obtaining the standard error of θ̂ which has been shown to valid in most situations. The bootstrap method for estimating the standard error of θ̂ is as follows: (1) Draw a random sample of size n with replacement from the observed data x1 , x2 , . . . , xn and compute θ̂. (2) Repeat step 1 a large number, B, of times obtaining B separate estimates of θ denoted by θ̂1 , θ̂2 , . . . , θ̂B (3) Calculate the mean of the estimates in step 2 i.e. PB θ̄ = i=1 θ̂i B (4) The bootstrap estimate of the standard error of θ̂ is given by σ̂BS (θ̂) = v uP u B (θ̂ − θ̄)2 t i=1 i B−1 4.8. METHODS OF OBTAINING SAMPLING DISTIBUTIONS OR APPROXIMATIONS167 The bootstrap is computationally intensive but is easy to use except in very complex problems. Efron suggests about 250 samples be drawn (i.e. B=250) in order to obtain reliable estimates of the standard error. To obtain percentiles of the bootstrap distribution it is suggested that 500 to 1000 bootstrap samples be taken. The following is a schematic of the bootstrap procedure. Figure 4.10: 168 CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS It is interesting to note that the current citation index for statistics lists about 600 papers involving use of the bootstrap! References: 1. A Leisurely Look at the Bootstrap, the Jackknife and Cross-Validation (1983) B. Efron and G. Gong; The American Statistician, February 1983, Vol. 37, No. 1 2. Bootstrapping (1993) C. Mooney and R. Duval; Sage Publications. This is a very readable introduction designed for applications in the Social Sciences. 3. The STATA Manual has an excellent section on the bootstrap and a bootstrap command is available. The Jackknife Method The jackknife is another procedure for obtaining estimates and standard errors in situations where • The exact sampling distribution of the estimate is not known. • We want an estimate of the standard error of the estimate which is robust against model failure and the assumption of large sample sizes. The jackknife is computer intensive but relatively easy to implement. 4.8. METHODS OF OBTAINING SAMPLING DISTIBUTIONS OR APPROXIMATIONS169 Assume that we have n observations x1 , x2 , . . . , xn which are assumed to be a random sample from a distribution p. Assume the parameter of interest is θ and that the estimate is θ̂ The jackknife procedure is as follows: 1. Let θ̂(i) denote the estimate of θ determined by eliminating the ith observation. 2. The jackknife estimate of θ is defined by θ̂(JK) = n 1X θ̂(i) n i=1 i.e. the average of the θ̂(i) . 3. The jackknife estimate of the standard error of θ̂ is given by " σ̂JK n (n − 1) X = (θ̂(i) − θ̂(JK) )2 n i=1 #1/2 170 4.8.9 CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS Bootstrap Example In ancient Greece a rectangle was called a “Golden Rectangle” if the length to width ratio was 2 √ = 0.618034 5+1 This ratio was a design feature of their architecture. The following data set gives the breadth to length ratio of beaded rectangles used by the Shoshani Indians in the decoration of leather goods. Were they also using the Golden Rectangle? .693 .748 .654 .670 .662 .672 .615 .606 .690 .628 .668 .611 .606 .609 .601 .553 .570 .844 .576 .933 We now use the bootstrap method for the sample mean and the sample median. 4.8. METHODS OF OBTAINING SAMPLING DISTIBUTIONS OR APPROXIMATIONS171 . infile ratio using "c:\courses\b651201\datasets\shoshani.raw (20 observations read) . stem ratio Stem-and-leaf plot for ratio ratio rounded to nearest multiple of .001 plot in units of .001 5** 6** 6** 7** 7** 8** 8** 9** | | | | | | | | 53,70,76 01,06,06,09,11,15,28 54,62,68,70,72,90,93 48 44 33 . summarize ratio Variable | Obs Mean Std. Dev. Min Max ---------+----------------------------------------------------ratio | 20 .66045 .0924608 .553 .933 172 CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS . bs "summarize ratio" "r(mean)", reps(1000) saving(mean) command: statistic: (obs=20) summarize ratio r(mean) Bootstrap statistics Variable | Reps Observed Bias Std. Err. [95% Conf. Interval] ---------+------------------------------------------------------------------bs1 | 1000 .66045 .0017173 .0197265 .6217399 .6991601 (N) | .626775 .70365 (P) | .6264 .7021 (BC) ----------------------------------------------------------------------------N = normal, P = percentile, BC = bias-corrected . use mean, clear (bs: summarize ratio) . kdensity bs1 . kdensity bs1,saving(g1,replace) 4.8. METHODS OF OBTAINING SAMPLING DISTIBUTIONS OR APPROXIMATIONS173 . drop _all . infile ratio using "c:\courses\b651201\datasets\shoshani.raw (20 observations read) . bs "summarize ratio,detail" "r(p50)", reps(1000) saving(median) command: statistic: (obs=20) summarize ratio,detail r(p50) Bootstrap statistics Variable | Reps Observed Bias Std. Err. [95% Conf. Interval] ---------+------------------------------------------------------------------bs1 | 1000 .641 -.001711 .0222731 .5972925 .6847075 (N) | .6075 .671 (P) | .609 .679 (BC) ----------------------------------------------------------------------------N = normal, P = percentile, BC = bias-corrected . use median,clear (bs: summarize ratio,detail) . kdensity bs1,saving(g2,replace) . graph using g1 g2 174 CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS The bootstrap distributions of the sample mean and the sample median are given below: Figure 4.11: