Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
FRESHMAN SEMINAR Professor Richard Wilson Fall Term 2004 Notes on Statistics You do not need to know much statistics to understand the matters in this course, but it is important to have a very clear understanding of the following two fundamental concepts: Randomness Statistical independence You might want to read the introductory chapter of any statistics textbook. Often recommended is Probability and Statistics for Engineers and Scientists, by Walpole and Meyers (McMillan). See also Xerox notes, Yardley Beers: "Theory of Error". I will merely put a few notes here on what you may need during the term. (We will not need them all at once) We also need to recognize the notation and jargon that statisticians use among themselves. But please statisticians: when talking to others try to avoid the jargon. Seminar example: In order to give a feeling for how one can combine random quantities to get a (fairly) definite result, we will start with a simple seminar exercise on random numbers. Each member of the seminar will get a (different) sheet of 100 digits chosen at random from the numbers 0 to 9. You will be asked to generate the means of rows, columns, and means of the total; and the standard deviation thereof. Then we will compare in the class everyone's mean, and see whether the class mean is within the expected range. This brings up at once the concept of independence. The fact that the class mean is closer to the limit of 4.5 depends upon the fact that everyone has a different list of random numbers that is independent of the list of every other student. If everyone had the same list, the class mean could well be further from 4.5. 2 years ago I made the mistake of getting the random numbers from the standard EXCEL program. When we went through this exercise we found that they were not random! (Although they appear ed to be so superficially) Anyone who has the enthusiasm can go to the EXCEL program and prove this for him/herself. I now have an ADD IN for EXCEL which does a better job. Calculations of probability The numerical estimates of probability are derived from the number of different ways of choosing the group concerned from the total sample, eg: A poker hand has 5 cards; what is the probability of getting 3 Jacks and 2 aces? 1 We address this by first addressing the partial question: What is the number of ways of getting two out of the four aces? The total number of ways of arranging four aces is four factorial, written as 4! or 4. But the number of ways of arranging the two aces picked is 2!, and the number of ways of arranging the other two aces is 2! Then the number of ways of choosing 2 aces our of 4 aces is given by the notation: 2 BINOM 4 2 = {4!} OVER {2!2!} =6 BINOM 4 3 = {4!} OVER {3!1!} = 4 Likewise, the number of ways of getting 3 Jacks out of 4 is Therefore, there are 6 x 4 = 24 ways of having 2 aces and 3 Jacks. The total number of 5 card poker hands, each one of which is equally likely is N = BINOM 52 5 = {52!} OVER {5!47} = 2,598,960 Therefore the probability of getting 2 aces and 3 Jacks is { 24 } OVER { 2,598,960 } = 0.9 x 10 SUP - SUP 5 Algebraic Presentation We sometimes express the results in a special algebraic notation or geometric diagram (a Venn diagram). An experiment (pulling candies out of a box) can lead to event A (mint candies, for example) or event B (toffees). We sometimes use a notation: P(A) P(B) Union of two events P(AUB) Intersection of events P(AB) probability of an event (A) probability of an event (B) A & B is called A U B A & B is called A or B A & B is called A B A & B is called both A and B Geometric Presentation Sometimes the overlap of probabilities is described in a VENN diagram. I wont draw one here. Transfer of files makes a mess of them. Additivity: P(AUB) = P(A) + P(B) - P(AB) e.g. If the probability that a polluting industry locates in Munich is 0.7, and the probability that it will locate in Brussels is 0.4 and the probability that it will be in either area or both is 0.8, what is the probability that it is in both?; in neither? 3 If A & B are independent there is no overlap in the Venn diagram and P(AB) = 0. Conditional probability We often want to discuss a conditional probability. The probability of an event B given that the event A occurred P(BA) = P(AB) (P(A) 0) P(A) e.g. Probability that US Air Shuttle departs on time P(D) = 0.83 Probability that US Air Shuttle arrives on time P(A) = 0.72 We get these from statistical records at each end. The probability of both on time departure and arrival is 0.78. P(AD) = P(DA) = 0.78 = 0.94 P(D) 0.83 Now in this case the event A is not independent of event A. If it is P(AB) = P(A) For example: The probability of a person getting leukemia from working 10 years at an average level of benzene in the workplace of 1 ppm is 0.1%. The probability of anyone getting leukemia in his lifetime is 0.7%. If you know that a person died of leukemia and worked 10 years at 1 ppm of benzene in the workplace what is the likelihood that his leukemia was caused by benzene? 4 P(BA) = 0.1% = 0.15 0.7% This is sometimes called the Probability of Causation. Distributions Probability Cumulative function f(x) F(x) HISTOGRAM CUMULATIVE PROBABILITY f(x) is the empirical distribution STACK {SIGMA # x} ~~f(x) = 1 Subject to: Then P(X=x) = f(x) P(x<X<x+dx) = f(x) dx P(a<x<b) = INT_a^b f(x) ~dx INT_{-INF}^{+INF} f(x) dx = 1 We can go to a continuous distribution The Cumulative Function 5 F(x) = INT_{-INF}^x f(t)dt f(x) = d{F(x)} over dx 6 Parameters of a distribution We have a set of numbers xi where i varies from 1 to N. To each is a weighting factor fi (fi > 0) = stack{N # SIGMA #i=1} ~f_i N the total weight Stack{N # SIGMA #i = 1} f_i x_i/N x the arithmetic mean GM the geometric mean = N sqrt{pi (x_i)^{f_i}} stack {pi # 1 N} is a symbol for product of all quantities 1 to N. ln(GM) = fi ln (xi/N) [We note that ln(GM) is the arithametic mean of the quantities ln xi] mode = value of xi with greatest weight fi median = for unweighted items, the value of x exceeded by 1/2 the xi in the list. The variance describes the deviation from the expected value. In problem 1-1 we can define the average of an infinite set of measurements to get the = stack{N # SIGMA # i=1} (x_i - overline{x})^2 /N expected value E. The root mean square (rms) or standard deviation is the square root of this. We define the variance of a set of numbers and xi, which is the sum of squares of the deviation of individual measurements about the average of an infinite set of such measurements of which xi is a small subset. 7 An estimate of the standard deviation can be obtained from the data. This is similar to the rms deviation from a small set, but has N-1 in the denominator, not N. This is because we s = sqrt{stack{N # SIGMA # 1} (x_i - overline{x})^2 /( N-1)} cannot get a deviation from one measurement; we need at least two. The difference between N-1 and N is in practice small, and most people (including me) are sloppy about the distinction, but it is important methodologically. (x_i - overline{x})^2 Expanding stack{N # SIGMA # 1} x_i = k overline{x} and noting that s = sqrt{{SIGMA {x_i}^2 - N {overline{x}}^2} over{N-1}} we find Parameters of a distribution We start with a set of numbers xi where i has a range 1 to N. To each xi we can assign a weight = stack{N # SIGMA # i=1} f_i = N N, the total weight If we consider a continuous distribution of numbers we can turn the sum into an integral and get = INT_{-INF}^{+INF} f(x) dx total weight mu = overline{x} = E(x) We will discuss mostly continous distributions. INT_{-INF}^{+INF} f(x) dx the arithmetic mean The mean of a function g(x) of x is 8 mu g(x) = E(g(x)) = INT_{-INF}^{+INF}~g(x) f(x) 9 Joint Distribution The probability of a number lying between x & (x + dx) AND y & y + dy is f(x,y) dx dy f(x,y) dx dy 1 g(x) = f(x,y) dy h(y) = f(x,y) dx A conditional probability distribution f(yx) = f(x,y) g(x) If f(x,y) does not depend on y f(xy) = g(x) f(x-y) = g(x) h(y) The normal, or Gaussian, distribution If a large number of independent measurements are values of the same quantity, the probability of each individual measurement lying between x and x + dx is P(x) dx; if we also assume that P(x) is symmetric about x = 0, the distribution of these measurements follows the "normal curve". This is sometimes called the Gaussian curve, in honor of the 19th century German mathematician, physicist and geographer. This curve, and the picture of Gauss, can be found on the German 10 mark banknote. n (x ; mu, sigma) = 1 over{sqrt{2 pi}} 1 over {sigma} e^{(x- mu)^2/2 sigma^2} 1 over {sqrt {2 pi}} CDOT 1 over {sigma} Note that INT_{-infinity}^{+ infinity} n(x) dx = 1 is in the equation. This enables the distribution to be normalized to unity: (µ is called the expectation or mean of the distribution and 2 is called the variance. Standard normal distribution Put: 10 z = {x- mu} over{sqrt{sigma^2}} n(z) = 1 over{sqrt 2}pi ^e SUP {-z^2/2} then INT_0^x n(z) dz This is the standard normal distribution. This, and in particular, is tabulated in various texts. For the distribution of the sum of two quantities, each normally distributed, we have also sigma^2 = sigma_1^2 + sigma_2^2 a normal distribution with standard deviations given by: One might, for example, measure the distance between two points: L = L1 - L2 P(L) = 1 over{sqrt{2 pi}} 1 over sigma exp ({(L- overline{L_1-L_2})^2} over{2 sigma^2}) The distribution in L will be normal with sigma^2 = sigma_1^2 + sigma_2^2 APPROX 2 sigma^2 ~~~ if ~~~ sigma_2 sigma_1 approx and where 1 is standard deviation of measurements of L1. [As an exercise, prove these last two statements]. Attached are another set of notes of the behavior of log normal distributions. Log normal distributions arise when the logarithm of a quantity is normally distributed. They are very widely used in exposure and risk analysis. Binomial distribution If p in the probability of success in a trial, and q = 1-p is the probability of failure, what is the b (x;n,p) = ~BINOM n x ~p^x q^{n-x} distribution of successes in N trials? µ Mean = np 2 Variance = npq = q µ Example: We have a bioassay with 100 rats. 11 If p = 0.2 is the probability of getting cancer in a lifetime. Then µ = 20 2 = 0.8 x 20 = 16 These numbers are the coefficients in a binomial expansion. Hence the term Binomial Distribution. Limits as n z = {x-np} over sqrt{npq} Then we get the standard normal distribution with [Unless p = 0 or q = 0] WE WILL USE THIS APPROXIMATION (until we go to a computer program) Poisson distribution P(x; lambda t) = e^{-lambda t}{(lambda t)^x} over {x!} Probability distribution of x given by number of independent outcomes in t is = average number of outcomes in unit time. Example: If the average rate of oil tankers entering NY Harbor is 10/day, yet we can only come P(x>15) is ~~1 - P (x<15) = 1 - STACK {15 # SIGMA # 0}~~ P(x; 10) with 15 in any one day, how often are we in trouble? P (x;10) can be read from the tables in the books or calculated. (Calculate it once in = 1 - 0.9513 = 0.0487/day your lives! Exercise: 1. Prove mean of distribution is t 2. Prove variance of distribution is t. If n and p keeping np = constant then b(x;n,p) p(x;µ) 12 Goodness of Fit How often do we get a deviation > ? is called the Critical Ratio. If a normal distribution int_xi^inf~e^{-z^2/2} dz area above = int_xi^inf e^{-z^2/2}dz / int_{-inf}^{+inf} e^{-z^2/2}dz or the fractional area This is so important that there are tables in all books "areas under the normal curve". BUT BEWARE: are you asking for the area above as a fraction of the whole distribution or INT_0^INF~~~ ~~~ INT_{-INF}^{+INF}~? as a fraction of the half distribution? It all depends on the problem. The distributions are called one sided, or one tailed, vs two sided or two tailed distributions. 13 1 2 3 Fraction of Area Above One Sided Two Sided 0.84 0.98 0.9987 0.68 0.95 0.9974 We use words to describe the tail: "Upper 95 percentile". ( = 2 two sided = 1.645 one sided) Exercise: If you perform a large number of bioassays with 100 rats each and call them statistically significant if P<0.05, how many times do you expect to be wrong? Statistical Independence The concept of statistical independence of quantities cannot be overstated. There are two major ways in which mistakes are made. One I loosely call the Feynman trap; the other the Tippett trap. Posing a problem for his undergraduate class, Richard Feynman, the Nobel physicist, noted a car in the parking lot, with a particular license plate, ARW357. One can easily assess the probability of seeing this license plate, by multiplying the independent probabilities of seeing each number (1/10) and each letter (1/26). The answer is one is eighteen million. Yet Feynman had just seen the license plate, so it had unity probability! Since Feynman asked the question when he already knew the answer, the statistical calculation was invalid. This point has been raised, less dramatically, by many others. See D.L. Goodstein, "Richard P. Feynman, Teacher", Physics Today 70-75 (Feb. 1989). We will discuss in the seminar various ways this appears in disguised forms in practice. Tippett, a famous English statistician, pointed out that if one sets a level of significance p<0.05, and then looks at twenty separate studies, one will be significant at this level. 14 15