Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
1 Probability Distributions In the chapter about descriptive statistics sample data were discussed, and tools introduced for describing the samples with numbers as well as with graphs. In this chapter models for the population will be introduced. One will see how the properties of a population can be described in mathematical terms. Later we will see how samples can be used to draw conclusions about those properties. That step is called statistical inference. Definition A random variable(rv) X is a variable whose value is determined by the outcome of a random experiment. As discussed for variables in samples, rvs can be categorical or numerical, and if they are numerical they can be either discrete or continuous. categorical % random variable discrete & % numerical & continuous In data description we observed that the proper methods depend on the type of the variable. This is similar for rvs. The choice of model depends on their type. The models for continuous rvs will be different than those for categorical or discrete rvs. All random variables are described by their distribution. Definition 1 The distribution of a random variable gives the values the random variable can have and the probabilities for these to occur. 1.1 Categorical Random Variables Definition The distribution of a categorical rv is a table giving all possible values (categories) of the rv and the associated probabilities. The distribution of a categorical rv can be shown in form of a bar graph. Example: The population investigated are the students of a selected college. The random variable of interest is the residence status, it can be either resident or nonresident, so it is categorical. The probability distribution is: resident status probability resident 0.73 nonresident 0.27 If a student is chosen randomly from this college, the probability for the student being a resident is 0.73. Is x the random variable resident status then write P (x =resident)=0.73. 1 1.2 Numerical Random Variables 1.2.1 Discrete Random Variables Remember: A discrete rv is a random variable whose possible values are isolated points along the number line. Example 1 1. number of teeth in a patient 2. number of houses in a certain block 3. number of heads when tossing 3 coins The probability distribution for a discrete rv, X, can be given as a formula, table, or graph that gives the possible values of X, and their corresponding probabilities, p(X). Example: Toss two unbiased coins and let X equal the number of heads observed. The simple events of this experiment are: coin1 H H T T coin 2 H T H T x P (X = x) 2 1/4 1 1/4 1 1/4 0 1/4 So that we get the following distribution for X=number of heads observed: x P (X = x) 0 1/4 1 1/2 2 1/4 With the help of this distribution can calculate that P (X ≤ 1) = P (X = 0) + P (X = 1) = 1/4 + 1/2 = 3/4. Properties for discrete probability distributions: • 0 ≤ P (X = x) ≤ 1 • P x P (X = x) = 1 Example 2 Consider the distribution of the variable X=number of vehicles owned per family. Suppose the following table gives the distribution of the variable x P (X = x) 0 0.015 1 0.235 2 0.425 3 0.245 4 ? 2 What is the value of P (X = 4), if no family owns more than 4 vehicles? P (X = 4) = 1 − (0.015 + 0.235 + 0.425 + 0.245) = 1 − 0.92 = 0.08, because the total of the probabilities must be 1. What is the probability that a family has more than 2 vehicles? P (X > 2) = P (X = 3) + P (X = 4) = 0.245 + 0.08 = 0.325. The expected value or population mean µ (mu) of a rv x is the value that you would expect to observe on average if the experiment is repeated over and over again. It is the center of the distribution. Definition: Let X be a discrete rv with probability distribution P (X = x). The population mean µ or expected value of X is given as µ = E(X) = X xP (X = x). x Example: The expected value of the distribution of x=the number of heads observed tossing two coins is calculated by 1 1 1 µ=0· +1· +2· =1 4 2 4 Example 3 The mean µ of vehicles owned per family is µ = 0 × 0.015 + 1 × 0.235 + 2 × 0.425 + 3 × 0.245 + 4 × 0.08 = 2.14 The standard deviation of a distribution measures the spread of the distribution. It denoted by the Greek letter σ. Let X be a discrete rv with probability distribution P (X = x). The standard deviation σ of the rv X is s X σ= (x − µ)2 P (X = x) x Example (continued): The population standard deviation of x=number of heads observed tossing two coins is calculated by s s 1 1 1 1 1 1 1 σ = (0 − 1)2 · + (1 − 1)2 · + (2 − 1)2 · = + = =√ 4 2 4 4 4 2 2 The alternative formula for the standard deviation is usually quicker to evaluate: r σ= X x2 P (x) − µ2 Example: Donations have been collected. Every person in a population has been asked for a donation. The following table gives the distribution of the donations given. 3 X $0 $10 $20 $50 P (X = x) .45 .30 .20 .05 Interpretation: If one person is randomly selected from the population the probability the person donated $50 is equal to 0.05. The mean of this distribution is X µ= xP (X = x) = 0 · 0.45 + 10 · 0.30 + 20 · 0.20 + 50 · 0.05 = 0 + 3 + 4 + 2.5 = 9.5 x This population donated in average $9.5. Calculating the standard deviation. P 2 σ2 = x (x − µ) P (X = x) 2 = (0 − 9.5) · 0.45 + (10 − 9.5)2 · 0.30 + (20 − 9.5)2 · 0.2 + (50 − 9.5)2 · 0.05 = 40.61 + 0.075 + 22.05 + 82.01 = 144.75 so that σ= √ σ 2 = 12.03 The standard deviation of this distribution equals $12.03. Using the alternative formula you do: P 2 2 σ2 = x x P (X = x) − µ = 02 · 0.45 + 102 · 0.30 + 202 · 0.2 + 502 · 0.05 − 9.52 = 0 + 30 + 80 + 125 − 90.25 = 144.75 and σ = 1.2.2 √ 144.75 = 12.03 Continuous Random Variables Continuous data variables are described by histograms. For histograms the measurement scale is divided in class intervals and the area of the rectangles put above those intervals is proportional to the relative frequency of the data falling into this interval. The relative frequency can be interpreted as an estimate for the probability for falling into the associated interval. With this interpretation the histogram becomes an ”estimate” of the probability distribution of the continuous random variable. A probability distribution of a continuous rv is a smooth curve, called a density curve if and only if 1. The total area under the curve is equal to 1. 2. The area under the curve and above any particular interval gives the probability of observing a value of x in the corresponding interval when an experimental unit is selected at random from the population. 4 We can calculate that the probability for falling in the interval [−2; 0] equals 0.46. Example: The density of a uniform distribution in an interval [0; 5] looks like this: Use the density function to calculate probabilities for a random variable x with a uniform distribution on [0; 5]: • P (X ≤ 3) = area under the curve from − ∞ to 3 = 3 · 0.2 = 0.6 • P (1 ≤ X ≤ 2) = area under the curve from 1 to 2 = 1 · 0.2 = 0.2 • P (X > 3.5) = area under the curve from 3.5 to ∞ = 1.5 · 0.2 = 0.3 5 Remark: Since there is zero area under the curve above a single value, the definition implies for continuous random variables and numbers a and b: • P (X = a) = 0 • P (X ≤ a) = P (X < a) • P (X ≥ b) = P (X > b) • P (a < X < b) = P (a ≤ X ≤ b) This is generally not true for discrete random variables. How to choose a model for a given variable in a sample? The model (density function) should resemble the histogram for the given variable. Fortunately, many continuous data variables have bell shaped histograms. The normal probability distribution provides a good model for modelling this type of data. 1.2.3 Normal Probability Distribution The density function of a normal distribution is unimodal, mound shaped, and symmetric. There are many different normal distributions, they are distinguished from one another by their population mean µ and their population standard deviation σ. µ is the center of the distribution, right at the highest point of the density distribution function. At the values µ − σ and µ + σ the density curve has turning points. Coming from −∞ the curve turns from a left to a right curve at µ − σ and again into in a left curve at µ + σ. 6 The function describing the density curve for a given mean µ and a given standard deviation σ is (x−µ)2 1 f (x) = √ e− 2σ2 σ 2π Example: If the normal distribution is used as a model for a specific situation, the mean and the standard deviation have to be chosen for that situation. E.g. the height of students at a certain university follow a normal distribution with µ = 178 cm and σ = 8 cm. Given that we know the mean and the standard deviation of a normal distribution we can locate intervals telling us where most of the values of the population are located. The 68-95-99.7 Rule In the Normal Distribution with mean µ and standard deviation σ: • Approximately 68% of the observations fall within one standard deviation of the mean, within [µ − σ, µ + σ]. • Approximately 95% of the observations fall within two standard deviations of the mean, within [µ − 2σ, µ + 2σ]. • Approximately 99.7% of the observations fall within three standard deviations of the mean, within [µ − 3σ, µ + 3σ]. Example: Continuing the example above. This rule tells us to expect about • 68% of the height of students at the university to fall within [178-8 , 178 +8] =[170 , 186] cm. • 95% of the height of students at the university to fall within [178-2(8) , 178 +2(8)8] =[162 , 194] cm. • 99.7% of the height of students at the university to fall within [178-3(8) , 178 +3(8)] =[154 , 202] cm. 7 Definition: The normal distribution with µ = 0 and σ = 1 is called the Standard Normal Distribution. In order to work with the normal distribution, we need to be able to calculate the following: 1. We must be able to use the normal distribution to compute probabilities, which are areas under the normal curve. 2. We must be able to describe extreme values in the distribution, such as the largest 5%, the smallest 1%, the most extreme 10% (which would include the largest 5% and the smallest 5%), that is we have to be able to calculate percentiles of any normal distribution. We first look how to compute these for a Standard Normal Distribution. Since the normal distribution is a continuous distribution the following holds for every normal distributed random variable X: P (X < z) = P (X ≤ z)= area under the curve from −∞ to z. The area under the curve of a normal distributed random variable is hard to calculate. There is no simple formula that can be used to calculate the area. Appendix Table A (in the text book) tabulates for standard normal distributed random variables for many different values of z ∗ the area under the curve from −∞ to z ∗ . These are values from the so called cumulative density function. From now on use Z to indicate a standard normal distributed random variable (µ = 0 and σ = 1). Using the table you find that, • P (Z < −1.75) = P (z ≤ −1.75) = 0.0401 and 8 • P (Z > 1.34) = 1 − P (z ≤ 1.34) = 1 − 0.9099 = 0.0901 and Shaded area equals 0.0901. • P (−1 ≤ Z ≤ 1) = P (Z ≤ 1) − P (Z ≤ −1) = 0.8413 − 0.1587 = 0.6826. The shaded area equals 0.6826. 9 The first probability can be interpreted as meaning that, in a long sequence of observations from a Standard Normal distribution, about 4.01% of the observed values will be smaller than -1.75. Try this for different values! Now we will look how to identify extreme values. Definition: For any particular number r between 0 and 1, the rth percentile xr of a distribution is a value such that the cumulative area from −∞ to xr is equal to r. If X is a random variable the rth percentile xr satisfies the following equation: P (X ≤ xr ) = r To determine the percentiles for a standard normal distribution (denote them by zr ), we can use Table A again. • Suppose we want to describe the values that make up the smallest 2%. So we are looking for the 0.02th percentile z0.02 , with P (Z ≤ z0.02 ) = 0.02. So look in the body of the Table A for the cumulative area 0.0200. The closest you will find 0.0202 for zr = −2.05 This is the best approximation you can find from the table. The result is that the smallest 2% of the values of a standard normal distribution fall within the interval (−∞, −2.05]. 10 • Suppose now we are interested in the largest 5%. So we are looking for z ∗ , with P (Z > z ∗ ) = 0.05 In Table A we can only find areas to the left of a given value, the first step is to determine the area to the left of z ∗ : P (z ≤ z ∗ ) = 1 − 0.05 = 0.95 That tells us that in fact z ∗ = z0.95 the 0.95th percentile. Checking the table we find values 0.9495 and 0.9505, with 0.95 exactly in the middle, so we take the average of the corresponding numbers and get z0.95 = 1.64 + 1.65 = 1.645 2 • And now we are interested in the most extreme 5%. That means we are interested in the middle 95%. Since the normal distribution is the symmetric the most extreme 5 % can be split up in the lower 2.5% and the upper 2.5%. Symmetry about 0 implies that −z0.025 = z0.975 . In Table A we find z0.025 = −1.96, so that z0.975 = 1.96 We found the result, that the 5% most extreme values are outside the interval [−1.96, 1.96]. Now remains the step to determine those areas for any normal distribution using the results of the standard normal distribution. Lemma: Is x normal distributed with population mean µ and population standard deviation σ then the standardized random variable X −µ is normal distributed with µ = 0 and σ = 1. Z= σ 11 Example: Let X be normal distributed with µ = 100 and σ = 5. 1. Calculate the area under the curve between 98 and 107 for the distribution chosen above. P (98 < X < 107) = P ( 98−100 < X−100 < 5 5 2 7 = P (− 5 < Z < 5 ) = P (−0.4 < Z < 1.4) 107−100 ) 5 This can be calculated using Table IV. P (−0.4 < z < 1.4) = P (z < 1.4)−P (z < −0.4) = 0.9192 − 0.3446 = 0.5746. • The first step you have to take is to standardize the rv, so that the result is standard normal distributed. The Lemma above tells you how it is done, subtract the mean µ and divide by the standard deviation σ. • In a second step you use the table for the standard normal distribution to find the probability. 2. To find the 0.3th percentile for this distribution, that is x0.3 , use 0.3 = P (X ≤ x0.3 ) = P ( X−100 ≤ x0.3 −100 ) 5 5 x0.3 −100 ) = P (Z ≤ 5 But then x0.3 −100 equals the 0.3th percentile from a standard normal distribution, which 5 we can find in Table A. x0.3 − 100 = −1.88 5 This is equivalent to x0.3 = −1.88 · 5 + 100 = 100 − 9.40 = 90.6. So that the lower 30% of a normal distributed random variable with mean µ = 100 and σ = 5 fall into the interval (−∞, 90.6]. • Again, in a first step standardize the rv, so that the result is standard normal distributed. This is done by subtracting the mean µ and dividing by the standard deviation σ. • In a second step use the table for the standard normal distribution to find the percentile. 12 Examples for Calculating Probabilities of a Standard Normal distribution Assumption: The random variable Z is standard normal distributed, that is population mean µ = 0 and standard deviation σ = 1. 1. Calculate P (Z < 0.53): In order to find the probability use Table A: (a) Find 0.5 in the left hand side column, this determines the row (b) then find 0.03 in the top row, this determines the column (c) now check for the value where the row and the column intersect In this example .7019. Result: P (Z < 0.53) = .7019. 2. Calculate P (Z > −0.79): Rule for Compliments: P (z > −0.79) = 1 − P (z ≤ −0.79). Now use Table A again: (a) Find -.7 in the left hand side column, this determines the row (b) then find .09 in the top row, this determines the column (c) now check for the value where the row and the column intersect In this example .2148. Result: P (Z > −0.79) = 1 − P (Z ≤ −0.79) = 1 − 0.2148 = 0.7852. 3. Calculate P (2.1 < Z < 4.79): It is P (2.1 < Z < 4.79) = P (Z ≤ 4.79) − P (Z < 2.1). Use Table A: (a) You find that 4.79 is larger than the largest value in the Table. That means that P (z < 4.79) = 1. (b) Find 2.1 in the left hand side column, this determines the row (c) then find .00 in the top row, this determines the column (d) now check for the value where the row and the column intersect In this example .9821. Result: P (2.1 < Z < 4.79) = P (Z ≤ 4.79) − P (Z < 2.1) = 1 − 0.9821 = .0179. Examples for Calculating Probabilities of a Normal distribution (not necessarily standard) Assumption: The random variable x is normal distributed with mean µ = 80 and standard deviation σ = 10. 13 1. Calculate P (X < 100): First standardize: P (X < 100) = P ( X −µ 100 − µ 100 − 80 < ) = P (Z < ) = P (Z < 2) σ σ 10 Now find this from Table A applying the method from above and find P (Z < 100) = 0.9772. 2. Calculate P (X > 79): Rule for Compliments: P (X > 79) = 1 − P (X ≤ 79). First standardize: P (X < 79) = P ( X −µ 79 − µ 79 − 80 < ) = P (Z < ) = P (Z < −0.1) σ σ 10 Now use Table A again: (a) Find -.1 in the left hand side column, this determines the row (b) then find .090 in the top row, this determines the column (c) now check for the value where the row and the column intersect In this example .4602. Result: P (X > 79) = 1 − P (X ≤ 79) = 1 − 0.4602 = 0.5398. 3. Calculate P (70 < X < 80): It is P (70 < X < 80) = P (X ≤ 80) − P (X < 70). First standardize: P (X ≤ 80) − P (X < 70) = P ( X −µ 80 − µ X −µ 70 − µ < ) − P( < )= σ σ σ σ P (Z < 0) − P (Z < −1) = 0.5 − 0.1587 = 0.3413 Examples for Calculating Percentiles of a Standard Normal distribution Assumption: The random variable z is standard normal distributed, that is population mean µ = 0 and standard deviation σ = 1. 1. Calculate the 0.9th percentile z.9 : P (Z < z.9 ) = 0.9: In order to find the percentile use Table A: (a) Find 0.9 in the body of the table, the closest you find is 0.8997 (b) go to the left and find in the left column 1.2 (c) got to the top and find in the top row 0.08 Result: The 0.9th percentile equals z.9 = 1.28. 14 2. Find the interval which contains the middle 50%: The middle 50% of the standard normal distribution can be found between the 0.25th percentile and 0.75 percentile. Now use Table A again: (a) Find 0.25 in the body of the table, the closest you find is 0.2514 (b) go to the left and find in the left column -0.6 (c) got to the top and find in the top row 0.07 (a) Find 0.75 in the body of the table, the closest you find is 0.7468 (b) go to the left and find in the left column 0.6 (c) got to the top and find in the top row 0.07 Result: The middle 50% of a standard normal distribution can be found between -0.67 and 0.67. Examples for Calculating Percentiles of ANY Normal distribution Assumption: The random variable X is normal distributed, with mean µ = 50 and σ = 20. 1. Calculate the 0.1th percentile x.9 : P (X < x.1 ) = 0.1: First standardize: P (X < x.1 ) = P ( X −µ x0.1 − µ x0.1 − 50 < ) = P (Z < ) = 0.1 σ σ 20 For this equation to hold x0.120−50 has to be the 0.1th percentile of the standard normal distribution. So x0.1 − 50 z0.1 = ⇐⇒ x0.1 = 50 + 20z0.1 20 In order to find the percentile z0.1 use Table A: (a) Find 0.1 in the body of the table, the closest you find is 0.1003 (b) go to the left and find in the left column -1.2 (c) got to the top and find in the top row 0.08 Result: The 0.1th percentile equals z.1 = −1.28 so that x0.1 = 50 + 20(−1.28) = −24.4 2. Find the interval which contains the middle 50% of this distribution: The middle 50% of the normal distribution can be found between the 0.25th percentile and 0.75 percentile. First standardize: P (X < x.25 ) = P ( X −µ x0.25 − µ x0.25 − 50 < ) = P (Z < ) = 0.25 σ σ 20 P (X < x.75 ) = P ( x0.75 − µ x0.75 − 50 X −µ < ) = P (Z < ) = 0.75 σ σ 20 and 15 For these equations to hold x0.2520−50 has to be the 0.25th percentile of the standard normal distribution and x0.7520−50 has to be the 0.75th percentile of the standard normal distribution. So z0.25 = x0.25 − 50 ⇐⇒ x0.25 = 50 + 20z0.25 = 50 + 20(−0.67) = 36.6 20 and z0.75 = x0.75 − 50 ⇐⇒ x0.75 = 50 + 20z0.75 = 50 + 20(0.67) = 63.4 20 Result: The middle 50% of this normal distribution can be found between 36.6 and 63.4. 16