Download Statistics 2014, Fall 2001

1 Probability Plots A simple way to assess the fit of a particular probability distribution to a data set is to superimpose the p.d.f. of the distribution on a relative frequency histogram of the data. A better method uses a graph which plots quantiles of the proposed distribution against the corresponding quantiles of the data set. Defn: The pth quantile of a data set is the smallest number such that the fraction of the data values less than that number is p. Defn: The pth quantile of the distribution of a continuous r.v. X is the smallest number x such that F(x) = p. Defn: For a random sample of size n, consisting of observed values x1, x2, …, xn, the ith order statistic is the ith data value when the data values are ordered from smallest to largest. i  0.5 The cumulative relative frequency associated with the ith order statistic is . n The general procedure for constructing a probability plot (or a quantile-quantile plot) is as follows: 1) Sort the data in ascending order. 2) For the sample size n, calculate the cumulative relative frequencies. 3) Invert the assumed distribution function to find the quantiles associated with the cumulative relative frequencies. 4) Do a scatterplot of the order statistics of the data v. the quantiles of the distribution. Constructing a normal probability plot If we have a set of data consisting of observed values x1, x2, …, xn, and we want to decide whether it is reasonable to assume that the data were sampled from a normal distribution, we proceed as follows: 1) Sort the data from smallest to largest, yielding the order statistics x(1), x(2), …, x(n). 1) Calculate the standardized normal scores  i  0.5  z(i )   1   , for each i = 1, 2, …, n, using Table 1 in Appendix A.  n  2) Plot the order statistics of the data set against the corresponding standardized normal scores on regular graph paper. If the plotted points lie near a straight line, then it is reasonable to assume that the data were sampled from a normal distribution. Example: p. 79, 3-63 A) Exponential Distribution If X is a continuous r.v. which has an exponential distribution with mean , then the c.d.f. of the  x distribution is F  x   1  exp    , for x > 0, and F(x) = 0, for x  0.   2 Then the pth quantile of the distribution (for 0 < p < 1) is given by  x p  1  exp    , implying x   ln 1  p  .   The calculated quantiles for the distribution would be found from  i  0.5  Q(i )   x ln 1   , for i = 1, 2, …, n. n   Example: The lifetimes (in years) of 10 cell phones are given as 4.23, 1.89, 10.52, 6.46, 8.32, 8.60, 0.41, 0.91, 2.66, 35.71. Is it reasonable to assume that X, the lifetime of a randomly selected cell phone, has an exponential distribution. The order statistics and quantiles are given in the table below; the sample mean is 7.971 years: i x(i) i  0.5 n Q(i) 1 2 3 4 5 6 7 8 9 10 0.41 0.91 1.89 2.66 4.23 6.46 8.32 8.60 10.52 35.71 0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95 0.41 1.30 2.29 3.43 4.76 6.36 8.37 11.05 15.12 23.88 We do a scatterplot of the order statistics v. the quantiles of the distribution: Exponential Probability Plot for Cell Phone Data 40 Ordered Lifetimes 35 30 25 20 15 10 5 0 0 5 10 15 20 Exponential Quantiles Note that the one large observation lies far from the line. 25 30 3 B) Lognormal Distribution To construct a q-q plot for a data set believed to be sampled from a lognormal distribution, we recall that a continuous r.v. X has a lognormal distribution if W = ln(X) has a normal distribution. Hence we can simply take the logs of the data values and do a normal probability plot to assess the fit of the data. Example: We found that the cell phone lifetime data did not quite fit an exponential distribution. Perhaps we could try a lognormal distribution instead. The following table has the order statistics for the logs of the data values, and the quantiles of a standard normal distribution. i w(i) i  0.5 n Q(i) 1 2 3 4 5 6 7 8 9 10 -0.89 -0.09 0.64 0.98 1.44 1.87 2.12 2.15 2.35 3.58 0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95 -1.64 -1.04 -0.67 -0.39 -0.13 0.13 0.39 0.67 1.04 1.64 The scatterplot of the order statistics v. the quantiles of the standard normal distribution is shown below. Since the data points all fall near a straight line, we may say that it is reasonable to assume that the data were sampled from a lognormal distribution, or that the r.v. X (= lifetime of a randomly selected cell phone) has a lognormal distribution. Lognormal Q-Q Plot of Cell Phone Data Log of Ordered Lifetime 4 3.5 3 2.5 2 1.5 1 0.5 0 -2 -1 -0.5 0 -1 1 -1.5 Quantile of Standard Normal Dist. 2 4 C) Weibull Distribution The c.d.f. for a Weibull(, ) distribution is   x   F  x   1  exp      , for x > 0, and F(x) = 0, for x  0. Then the pth quantile for the      distribution is given by   Q( p)   p  1  exp      , giving  ln Q  p     ln     ln   ln 1  p   .      Of course, we don’t know the values of the parameters, but it doesn’t matter. There should be a straight line relationship between ln Q  p   and ln   ln 1  p   . Hence if the data come from a Weibull distribution, then the graph of the log of the order statistics v. ln   ln 1  p   should be a straight line. Example: 14 randomly selected identical coupons of steel were stress tested, and their breaking strengths were recorded (the order statistics are given in table below). Also given in the table are the values of ln   ln 1  p   . i ln[x(i)] i  0.5 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14 3.53 3.97 4.43 4.67 4.92 5.05 5.17 5.33 5.42 5.54 5.62 5.69 5.80 5.83 0.036 0.107 0.179 0.250 0.321 0.393 0.464 0.536 0.607 0.679 0.750 0.821 0.893 0.964   i  0.5   ln   ln 1   14     -3.31 -2.18 -1.63 -1.25 -0.95 -0.70 -0.47 -0.26 -0.07 0.13 0.33 0.54 0.80 1.20 The scatterplot of the second column in the table v. the last column is shown on the next page. All of the points lie near a straight line; hence it is reasonable to assume that the random variable X (= breaking stress for randomly selected steel coupon) has a Weibull distribution. 5 Weibull Q-Q Plot for Steel Breaking Stress 7 Log of Order Statistic 6.5 6 5.5 5 4.5 4 3.5 3 -3.5 -2.5 -1.5 -0.5 0.5 1.5 Log(-Log(1-p)) Distributions of Discrete Random Variables For a discrete r.v. X, there is a finite or at most countably infinite number of possible values. Defn: The probability distribution for a discrete r.v. X is a set of ordered pairs of numbers. In each pair, the first number is a possible value x of X, and the second number is the probability that the r.v. X assumes the value x. Example: Our random experiment is to flip a fair coin twice. The r.v. X is defined to be the number of heads that occur. The sample space of the experiment is S = { HH, HT, TH, TT }. Since the coin is fair, each of these outcomes occurs with probability 0.25. The distribution of the r.v. x is given in the table below: x 0 1 2 P(X = x) 0.25 0.50 0.25 Defn: For a discrete r.v. X with possible values x1, x2, …, xn, the probability mass function (or p.m.f.) is f  xi   P  X  xi  , for i = 1, 2, …, n. The p.m.f. must satisfy two conditions which follow from Kolmogorov’s axioms: 1) f  x   0 for all x, and n 2)  f  x   1. i 1 i 6 Mean and Variance of a Discrete Distribution n Defn: The mean, or expectation, or expected value, of a discrete r.v. X is given by    xi f  xi  . i 1 n Defn: The variance of a discrete r.v. X is given by  2    xi    f  xi  . The standard deviation of 2 i 1 X is just the square root of the variance. Note: It is generally easier to calculate the variance using the equivalent formula n    xi2 f  xi    2 . 2 i 1 Bernoulli Distribution The simplest type of discrete distribution is one for which the r.v. has two possible values. Defn: A discrete r.v. X is said to have a Bernoulli distribution with parameter p (X ~ Bernoulli(p)) if there are exactly two possible values 0 and 1 of X, such that P(X = 1) = p, and P(X = 0) = 1 – p. The mean of a Bernoulli distribution is given by n    xi f  xi   1 p    0 1  p   p . i 1 The variance of a Bernoulli distribution is given by n  2   xi2 f  xi    2  1 p    0  p   p 2  p 1  p  . i 1 Example: Our random experiment is to flip a fair coin once. We define X = number of heads. Then X ~ Bernoulli(0.5). Then P(X = 1) = 0.5, and P(X = 0) = 0.5. The mean of this distribution is  = 0.5. The standard deviation is  = 0.5. Binomial Distribution Assume that instead of flipping the fair coin once, we flip it 10 times, and we define our discrete r.v. X to be the number of heads. We want to be able to calculate probabilities associated with X, as well as the mean and standard deviation. Assume that we have n independent, identically distributed (i.i.d.) r.v.’s Y1, Y2, …, Yn, each of which has a Bernoulli distribution. n Let X   Yi . Then X is said to have a binomial distribution. i 1 Defn: A discrete r.v. X is said to have a binomial distribution with parameters n and p if n n x P  X  x     p x 1  p  , for x = 0, 1, …, n.  x 7 Derivation of the Binomial Distribution: A binomial experiment is a random experiment which satisfies the following conditions: 1) The experiment consists of a fixed number, n, of trials. 2) The trials are identical to each other, in that they are performed the same way. 3) The trials are independent of each other, meaning that the outcome of one trial does not give any information about the outcome of any other trial. 4) Each trial has two possible outcomes, which we will call Success and Failure. 5) The probability of Success is the same, p, for each of the trials. We let X = # of Successes in the n trials. The possible values of X are 0, 1, 2, 3, …, n. For a given x  {0, 1, 2, …, n}, what is P(X = x)? One way that we can have exactly x Successes out of n trials is for the first x trials to result in Success and the remaining n – x trials to result in failure. If the trials are independent (the outcome of one trial is unrelated to the outcome of any other trial), then P(x Successes followed by n-x Failures) = p 1  p  . Any other ordering of x Successes and n – x Failures will have the same probability of occurring. How many such orderings are there? x n x Defn: Given a set of n objects, the number of ways to choose a subset of x of the objects is given by the binomial coefficient: n n! .    x  x ! n  x  ! The number of different orderings of x Successes and n – x Failures is the same as the number of ways of choosing x of the n trials to be Successes. Hence, the probability that there will be exactly x Successes in n Bernoulli trials is given by: n n x P  X  x     p x 1  p  , for x = 0, 1, …, n.  x The mean and variance of a binomial r.v. X may be found as follows: Since X is the sum of n independent, identical Bernoulli r.v.s’, the mean of X is just n times the mean of the Bernoulli distribution, i.e.,   np . The variance of X is also just n times the variance of the Bernoulli distribution, i.e.,   np 1  p  . Example: Let’s go back to our random experiment of flipping a fair coin 10 times. Let X = number of heads that occur. Does this satisfy the conditions of being a binomial experiment? 2 The mean is   np  (10)(0.5)  5 . The variance is  2  np 1  p   (10)(0.5)(0.5)  2.5 , and the standard deviation is  = 1.5811. 10  5 5 We have P  X  5     0.5   0.5  0.24609375 . 5  5 10  x 10  x  0.6230 . Clearly the calculations can What about P(X  5)? P  X  5      0.5  0.5 x 0  x  become tedious. 8 To find binomial probabilities using Excel: If X ~ binomial(n, p), and we want to find P(X  x), then in the cell of the worksheet, enter =BINOMDIST(x, n, p, TRUE). In our example, we want to find P(X  5). In cell A1, we enter =BINOMDIST(5,10,0.5,TRUE) We get 0.6230. If we want P(X = 5), we enter =BINOMDIST(5,10,0.5,TRUE) – BINOMDIST(4,10,0.5,TRUE) We get 0.2461. If we want P(X > 5), we enter =1-BINOMDIST(5,10,0.5,TRUE) We get 0.3770. Examples of Binomial Experiments: 1) Assume that the date is October 1, 2008. We want to predict the outcome of the Presidential election. We will assume, for simplicity, that there are only two candidates, Joe Asinus and Judy Olifaunt. We select a simple random sample of 1000 voters, and ask each voter in the sample, “Do you intend to vote for Asinus or Olifaunt?” Let X = number of voters in the sample who plan to vote for Asinus. 2) A worn machine tool produces 1% defective parts. We select a simple random sample of 25 parts produced by this machine, and let X = number of defective parts in the sample. 3) I give a pop quiz to the class consisting of 10 multiple choice questions, each with four possible responses, only one of which is the correct response. A student has been goofing off all semester, and comes to class totally unprepared for the quiz. He decides to randomly guess the answer to each question. Let X = his score on the quiz. 4) It is known that of the entire population of adults in Florida, 5% have a certain blood type. We select a random sample of Florida and obtain blood samples to test. Let X = number of people in the sample who have the blood type. 5) On p. 88, there is a graph illustrating the random experiment of flipping a fair coin 20 times, counting the number of heads. 9 Poisson Distribution This distribution provides the model for the occurrence of rare events over a period of time, distance, or some dimension. Examples: 1) X = number of cars driving through an intersection in an hour. 2) X = number of accidents occurring at an intersection in a year. 3) X = number of alpha particles emitted by a sample of U-238 over a period of time. The common characteristics of Poisson processes are these: We divide the interval of time (distance, etc.) into a large number of equal subintervals. 1) The probability of occurrence of more than one count in a small subinterval is 0; 2) The probability of occurrence of one count in a small subinterval is the same for all equal subintervals, and is proportional to the length of the subinterval; 3) The count in each small subinterval is independent of other subintervals. We let X = count of occurrences in the entire interval. Defn: A discrete r.v. X is said to have a Poisson distribution with mean  if the p.m.f. of the distribution is e   x f  x  , for x = 0, 1, 2, 3, …. x! The mean and variance of the distribution are   E  X    and V(X) = . Note: We may derive the Poisson distribution as a limiting case of the binomial distribution with the number of trials going to infinity and the probability of success on each trial going to 0 in such a way that the mean of the distribution remains constant. Example: p. 95, Exercise 3-95

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Statistics 2014, Fall 2001