Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Inductive probability wikipedia , lookup
Birthday problem wikipedia , lookup
Ars Conjectandi wikipedia , lookup
Mixture model wikipedia , lookup
Probability interpretations wikipedia , lookup
Random variable wikipedia , lookup
Central limit theorem wikipedia , lookup
A Short Introdution to Probability for Statistis Math 146 Part I. Modeling Observations 1 Statistial observations Statistis is onerned with the analysis of quantitative observations, usually repeated several times. For example, we might onsider • • • • Asking individuals about their opinions or preferenes Measuring some physial or hemial quantity, e.g. the onentration of a hemial in a water pool Analyze a ow of bits to determine if a signal is present, or whether it is just noise Observe some signiant marker to determine whether a medial treatment was eetive We ould go on for a long time. In suh ases, we are not so muh onerned about the single result of our observation, but rather what, if any, impliation our observation(s) have for a more general onlusion, that is, was our observation just a random outome, with no general impliation, or was it evidene of a more general fat? Suh a question annot be answered simply by looking at the data, sine we are wondering about what is there behind the data. A reasonable answer will, neessarily, require that we add our own insight to the data, whih, to be as eetive as possible, means having a (preferably mathematial) model for the whole observation proedure. This approah, developed at the turn of the 20th Century, has proved to be really eetive, as opposed to the 18th-19th Century fous on searhing for representative observations, observing typial situations that ould be extrapolated to more general ases. The problem with the latter approah (whih survives, in a way, at least at the margins) is that it is pratially impossible to dene what is typial in an objetive way. The fundamental mathematial toolbox that has turned out to work is Probability Theory. To get deep into this theory we would need a lot more mathematial mahinery than what is available to us here, but we an get an intuitive understanding if we an take a few fats for granted. 1 2 A model for statistial observations 2 2 A model for statistial observations We will onsider observations that result in numerial data. The simplest experiment that we an onsider is the ip of a oin. Let's say we are asking whether it will turn up heads, and we reord 1 if it does, 0 if it turns up tails. While there will be only one outome, as far as we an tell beforehand, it ould be a 0 or a 1. We all suh an unertain outome a Random Variable : a variable beause it an take more than one value, and random beause we annot tell with ertainty what this value will be. As usual in math, we assign a symbol to this random variable, let's say, for example, we all it X. A probabilisti model onsists in assigning to eah possible value of X (there are two in this ase, but, in general, there will be many more) a number between 0 and 1 whih, loosely speaking, quanties the likelihood of eah value ourring, 0 meaning that it will not our, and 1 that it will ertainly our an intermediate value orresponding to unertainty about its ourrene (the loser to 1, the higher the likelihood of it ourring). We all these numbers probabilities, and write, in our example, P [X = 1] = p where p is the probability that we assign to the oin turning up heads. 2.1 Consisteny of probability assignments We assigned a probability to the ourrene (the tehnial name in probability is event) X = 1, so what about X = 0? a onsisteny requirement, tehnially an This an be determined by listing axiom, that probability assignments have to satisfy. Additivity Outomes that annot our simultaneously have a ombined proba- bility that is the sum of the individual probabilities. In our ase, the oin will turn out either heads or tails, not both, so the probability that either one or the other ours will be alulated as P [X = 0 or X = 1] = P [X = 0] + P [X = 1] In this simple ase that determines P [X = 0], (1) beause we already stipulated (that's another axiom, even if we did not ag it as suh) that something that will ertainly our is assigned probability 1, and it is ertain that either heads or tails will our, hene that P [X = 0 or X = 1] = 1. From (1) it then follows that P [X = 0] = P [X = 0 or X = 1] − P [X = 1] = 1 − P [X = 1] = 1 − p 2 A model for statistial observations 3 2.2 How do we assign probabilities? This question is surprisingly deliate, and has led to very heated arguments. Simplifying things, we an list three basi approahes note that they are not mutually exlusive, meaning that, depending on the spei situation a researher might resort to one of these approahes, only to use another one in a dierent ase. 1. The Classial Model. This relies on the ability to set up a model where a random variable X an take a nite number of values, and, sine we see no reason to believe that one value is more or less likely than any other 1, eah is assigned the same probability. In the oin tossing ase, if we have no reason to think that the oin is biased (that is, that it will more likely turn up on one side rather than the other), we would assume p = 1 − p, p = 21 . A slightly riher ase would be a roulette wheel, where, in most Amerian roulettes, we have 38 slots (labeled 1 to 36 and, additionally, a so 0 and a 00 slot). Here X an take values 0,00,1,2,...,36, and we would 1 2 assign probability 38 to eah . A yet more elaborate example is the throw of two die. It turns out that the best model onsiders as equally likely all pairs (i.j), where both i and j take on all possible values between 1 and 6. There are 36 suh pairs, so any outome 1 3 of 36 . (i.j) is assigned a probability 2. The Frequentist Model. This is the usual model applied when addressing situations that are (or an be) repeated may times over. It is loosely based on a basi theorem in probability, but that's not its foundation, as it would lead to a irular argument. outome has probability p, Basially, it states that if an then, if we repeat the observation over, and over again, making sure that eah observation is not inuened by any of the previous ones (for example, when ipping a oin we are not heating so that it will turn up as in the previous toss, or vie-versa, both of whih heats are pretty easy to do), the 4 frequeny of ourrene of this outome will approah its probability . Thus if we are tossing oins and P [X = 1] = 21 , over many tosses roughly half the outomes should turn 1 This assumption goes under the moniker of Priniple of Suient Reason 2 Real life roulettes, just like real life oins, are unlikely to be so neatly onforming to the ideal. You have possibly enjoyed one of the several movies where lever gamblers were able to spot the bias in a roulette, and/or how the roupier manages it, and gain a lot of money in real asinos, anybody suspeted of keeping trak of long-term roulette outomes is very likely to be thrown out right away (this has to do with the seond meaning of probability, disussed below) 3 Of ourse, an argument similar to the one in the previous footnote applies here too. 4 This goes by the name of Empirial Law of Large Numbers. We will mention the mathematial Law of Large Numbers, whih might look similar, but is logially unrelated. That this empirial law should hold ould be taken as a denition of probabilisti experiment, in the sense that if repeating an experiment over and over seems to indiate that the frequeny of ourrenes does settle to a spei value, we an assume that probability is a good model for this experiment. 2 A model for statistial observations 4 5 out heads . As you will realize, this approah is the foundation of what goes under the name of Classial Statistis, where the probabilities in a model are assigned based on repeated observations of the phenomenon in question, 3. The Subjetive Model. This is partiularly popular when addressing onetime events (as opposed to repeated experiments), where the frequentist approah is simply impossible. In this model, probabilities are assessments of how likely a given event is to our. subjetive The lassial way of doing this is to ask the subjet to provide odds that s/he would be willing to aept on a bet over the outome. This approah has evolved into a more systemati method, known as Bayesian Statistis that has gained onsiderable ground lately. Relying on a deeptively simple theorem, known as Bayes' Theorem, it goes like this: given the problem of p = 12 ?), we as a random variable (after all, we do not its value assessing a probability (for example, is this oin fair, that is start by onsidering p for sure), and assume an a priori probability distribution to it (a popular uniform distribution over [0, 1], as dened hoie in this ase would be a below). We now perform repeated tosses, and use the outome to hange, if warranted, our initial guess, using Bayes' Theorem to do so in a preise way. This method requires, in pratial situations, very large omputing power, and the urrent availability of suh power has made it very popu- Big Data are mostly based on Bayesian We will not address Bayesian tehniques in this lass, as they require signiant mathematial developments, not to mention the short time we have on our hands. lar. In fat, rigorous approahes to methods. Classial statistis, whih is what we will look at, is based on model number 2. Given our time limitation, we will onsider only the simplest appliations, whih, however, should help you get the feeling of its methods. This is still the most ommon approah in medial and soial sienes, even though it has 6 been questioned for a number of reasons . It may easily happen that you may end up using its tools in your future job, and, with an understanding of the the philosophy that underpins it, you should have no trouble adopting methods that we will have had no time to look at, but whih are based on the same priniples. 5 This does not mean at all that, if we toss a fair oin enough times, say, 1,000,000, the number of Heads results will approah half of the number of tosses, say 500,000. If we toss a n 1 oin N times, and n outomes are heads, we would expet to approah , but that's very 2 √N N n N + ≈ 21 , while dierent from n approahing : for example, if n ≈ N , we will still have 2 2 N n is drifting away from N 2 6 The ritiisms levied on experiments based on lassial statistis are not trivial. It is true that a number of faults refer to researhers not following proper statistial methods, thus drawing onlusions that are unwarranted even within the lassial framework. What is more troubling is that, in a distressingly large number of ases, properly obtained results ould not be repliated by independent researhers or even by the original researhers. Of ourse, the gold standard of sienti researh is reproduibility of experimental outomes, so this is a very serious issue. The arguments about this are ongoing and denitely beyond our sope, and the best we an do, for now, is be aware that statistial statements need to be taken with a large grain of salt. 3 Probability Models 5 3 Probability Models 3.1 Generalities 3.2 Distributions X that may take n possible values x1 , x2 , . . . , xn , pk = P [X = xk ]. By onsisteny, we need pk > 0, and p1 + p2 + · · · + pn = 1. For brevity Pn the sum on the left hand side is usually written using sigma-notation: k=1 pk . The olletion of numbers p1 , p2 , . . . , pk is alled the probability distribution of the random variable X . Let's onsider a random variable with probabilities It is rare that we will look at a single random variable; it is ommon to have to look at more than one quantity, and at the very least, even if we are looking at a single quantity (like, say, the onentration of a hemial in body of water), we are well advised to take more than one measurement, resulting in observing several random variables (presumably all with the same distribution). In these ases we will be looking at (for simpliity, let's look at two random variables) P [X1 = xi , X2 = xj ] = pij (the omma stands, as is ustomary in probability for and: we are looking at th probability that the value xj . distribution X1 take the value xi , and X2 take pij , this is often alled the joint Considering all possible values of of X1 and X2 . Determining the joint distribution of several random variables is, in general, umbersome, and requires some spei information on how eah may aet the others, but there is one ase when things are simple: when the variables are Denition independent. Random variables X1 , X2 , . . . , Xr 7 if all joint prob- are independent abilities fator: for all possible hoies of x1 , x2 , . . . , xr P [X1 = x1 , X2 = x2 , . . . , Xr = xr ] = P [X1 = x1 ]·P [X2 = x2 ]·. . .·P [Xr = xr ] In this ase, determining the joint distribution is equivalent to determining the distribution of eah random variable separately. In partiular, if all random variables have the same distribution, the olletion is said to be independent, identially distributed random variables (i.i.d.), and simple random sample of the ommon distribution. 7 A motivation for this denition relies on the notion of onditional probabilities. This is a set of is alled a an important onept, but we an skip it for the strit purpose of olleting fats we will need for our statistial tools. Do look at the orresponding setion in the book, though. 3 Probability Models 6 3.3 Spei models 3.3.1 Disrete 8 distributions In most of our appliations, we will assume that the distribution we are working on belongs to a spei lass. Here are three of the simplest: • Bernoulli Distribution : a random variable with a Bernoulli distribution an take only two values, for example 0 and 1. P [X = 1] = p, P [X = 0] = 1−p. Thus, we will have p depends on the spei question ould be to determine p The value of ase we are dealing with. A statistial from observations of several random variables all with the same Bernoulli distribution (of a simple random sample of Bernoulli distribution) • Binomial Distribution: It turns out that if X1 , X2 , . . . , Xn is a simple random sample of a Bernoulli distribution with parameter p, The sum Pn Y = k=1 Xk of these independent variables, whih will take values 0, 1, 2, . . . , n, has a distribution given by P [Y = r] = n! pr (1 − p)n−r r!(n − r)! (n!, read n fatorial, is dened as the produt of all integers between 1 and • n: n! = 1 · 2 · 3 · . . . · n) Geometri Distribution : Consider a sequene (potentially unbounded) of independent Bernoulli variables X 1 , X 2 , X 3 , . . .. We look at the rst vari- able that takes the value 1. It ould be the rst, seond, and so on in priniple, we ould keep going and never get a 1, of ourse. Call index of this variable. Its possible values are 1, 2, 3, . . . , G the without bound. By independene, it is easy to see that P [G = k] = P [X1 = 0]·P [X2 = 0]·. . . P [Xk−1 = 0]·P [Xk = 1] = (1−p)k−1 p (Note that, as k beomes large, sine 1 − p < 1, the probability that takes this large value beomes smaller and smaller, approahing 0 as beomes really large) 9 G k 8 That means that the values our variables an take are disrete, i.e., an be listed in the examples here, the values are positive integers, as an example. In priniple, sine we an only observe quantities up to some nite preision, and within some nite range, all random variables are disrete. However, as we disuss in the next subsetion, it is unwieldy to work with a really huge number of values that are very lose to one another (like, say, all frations n , for all integers n). of the form 1010 9 This is the standard model for playing the lottery or any other game of pure hane, where eah round is independent of the others, and your probability of winning is p p. Thus, even if very small, the probability of never winning beomes eventually very small. Unfortunately, it may take several lifetimes to make this probability really small. Also, as an be seen by independene (and more preisely, using onditional probabilities), things do not improve if you play, say, N times, and never win: the next rounds are independent of the ones you played, so your hanes of winning, given that you lost at the start. N times, are the same as your hane 3 Probability Models • 7 Uniform distribution over n values a1 , a2 , . . . , an . That's a distribution 1 that assigns probability n to eah of the values. That's the distribution that the lassial method assigns to the possible values. Of ourse, there are many other disrete distribution models (an important one Poisson model is the , where X an take any integer value and P [X = k] = λk −λ , here λ > 0 is a parameter), but we will atually not work with any of k! e these disrete distributions diretly. 3.3.2 Continuous Distributions Suppose you are timing the arrival of the rst ustomer at a servie station. If that's a bank teller, you might reord the time with a preision of a minute. If it is a omputer network servie, like a printer or a web page request, the timing ould be preise up to the hundredth of a seond. In any ase, the possible values of your observations are a very large set, they are very lose to eah other (the dierene between 2 and 3 minutes, is just one minute, and it's muh worse if you are at the hundredths of seonds), and the probability of an event ourring exatly at a given value is extremely small (eventually, an event will our and that will lead to a value for your observation, but, a priori, it is extremely unlikely that the rst web page request will our at any xed time point). When faed with these situations, whih are extremely ommon, it makes little sense to onsider a list of possible outomes, and it is muh more pratial to onsider the outome set to onsist of all real numbers, with probabilities assigned (in a onsistent way) not to individual outomes, but to intervals So we will not ask, for example, what is the probability that our rst web page request will our after 230.302 seonds (a highly unlikely event), but rather, say, what is the probability that our rst web page request will our in the interval between 200 and 300 seonds. Thus, rather than looking at statements of the form < P [X = x], we will look at statements of th form P [a ≤ X ≤ b] (with ≤, depending on preferenes). In this ase, we talk of ontinuous instead of random variables. With some advaned math tools, one an build a omplete theory, starting with approximations from disrete random variables, and this onstrution, ompleted in the 1940s by the great Russian mathematiian N.N. Kolmogorov, provided the rigorous foundation for probability as a ompletely legitimate part of mathematis. We will onsider only a less general lass of ontinuous variables, so alled absolutely ontinuous random variables. tene of a ontinuous funtion (alled a for any a and b, P [a ≤ X ≤ b] They are haraterized by the exis- probability density ) f (x) ≥ 0, suh that, is given by the area of the portion of the o- and the horizontal axis, between a and ´b b. From alulus, this area is denoted as a f (x)dx. Note that the area below a single point is 0, orresponding to the fat that we annot really onsider ordinate plane between the graph of f the probability of a spei single real number ourring as outome thus 3 Probability Models 8 P [a < X < b] = P [a ≤ X ≤ b] = P [a < X ≤ b] = P [a ≤ X < b] ontinuous random variables. a = −∞, b = ∞), and sine an outome has to our, this means that P [−∞ < X < ∞] = (note the <, for absolutely This an extend to the whole real line (as in ˆ ∞ f (x)dx = 1 −∞ sine ∞ is not a number, but a symbol indiating without bounds, and thus no variable an be equal to innity). It is ommon usage to onsider, instead of the density, the so-alled lative distribution funtion umu- (df ), dened as F (x) = P [X ≤ x] = ˆ x f (t)dt −∞ By onsisteny, P [a < X ≤ b] = F (b) − F (a) so that knowledge of the df allows us to ompute the probability orresponding to any interval. A ommon model for the rst arrival problem we started with is the exponential distribution, where the density is given by f (x) = where λ > 0 is a parameter. ( 0 λe−λx x<0 x≥0 It turns out that, for exponential distributions, the df is given by F (x) = 1 − e−λx A simple and useful ontinuous distribution is the an interval [a, b]. uniform distribution over [a, b], proporP [c ≤ X ≤ d] = [0, 1], if 0 ≤ c ≤ This assigns a probability to any subinterval of tional to the length of the sub-interval. Thus, if a ≤ c ≤ d ≤ b, d−c b−a . In the useful speial ase of a uniform distribution over d ≤ 1, P [c ≤ X ≤ d] = d − c. A very ommon model, motivated by a basi theorem that we will disuss momentarily, is the Gaussian ters, ommonly denotes by µ Normal or and σ, distribution, dened by two parame- whose density is given by (x−µ)2 1 f (x) = √ e− σ 2 2πσ X ∼ N (µ, σ) for suh a random variable. µ = 0, σ = 1, the so alled standard normal distribution. X−µ that if X = N (µ, σ), then the new random variable Z = σ You will often see the notation A speial ase is It turns out is N (0, 1). The df for a normal random variable annot be written in terms of familiar funtions, but the df for N (0, 1) variables, often denoted by Φ, has been alulated to any desired preision, and is available as a table in all statistis 10 and probability books 10 Spreadsheets and statistial software will be happy to give you diretly the df of any 3 Probability Models 9 3.4 Indexes It turns out that some quantities about a distribution an be onvenient to evaluate and use. In fat, when onsidering a spei lass of models (like, binomial, Normal, and so on), these quantities are usually suient to speify the model ompletely. The main ones are: • Expetation mean (also expeted value, and sometimes ): in the disrete Pn EX = k=1 xk P [X = xk ]. With a little more advaned math this denition extends to the ase when there are innitely ase it's dened as many values and when the variable is ontinuous. If distribution it is easy to see that X has a Bernoulli EX = p. If it has binomial distribution, 1 then EX = np. For the geometri distribution, EX = . For the expop 1 nential distribution, EX = λ , and for the normal distribution EX = µ. As you an see, in all but the last example EX fully speies the model. The expetation of the sum of n random variable is equal to the sum of the expetations. We note the remarkable fat (also easy to prove) that • Variane : that's the expeted value of the square of the dierene between the random variable and its expetation: i h 2 V ar [X] = E (X − EX) Using the square makes the math easier (it may not be obvious, but that's the ase), and it makes sure we ount positive and negative dierenes 1 equally. Exponential distributions have variane 2 , and normal distribuλ 2 tions have variane σ . The variane of the sum of n random variable is not given by the sum of the individual varianes at all, exept in the ase of • independent 11 Moments : variables more generally, we an dene r Mr = E [(X − EX) ] as the r−th moment. Conventionally, the expetation is onsidered the rst moment. The variane is, obviously, the 2nd moment. We will not use these indexes, and variations on them, in our examples. Note that when innitely many values are possible, it may happen that higher moments, or even the variane or even the expetation, may not be dened. It is not hard to show that if the r−th moment exists, then all lower moments are also dened. normal variable, not only standard. N (0, 1), using the fat that if Z When these are not available, we an use a table for is standard then X = µ + σZ is N (µ, σ). 11 Atually, the ondition for the variane of a sum to be the sum of the varianes is far less restritive, but in our appliations, the deiding fator will always be independene 3 Probability Models • Perentiles 10 or In partiular, Quantiles : q 21 These are numbers is alled the median, q 41 the qp suh that P [X ≤ qp ] = p. rst quartile, and q 43 the third quartile (the median is the seond quartile). You may also read about k (quantiles suh that p = 5 for k = 1, 2, 3, 4), or perentiles, k referring to p = 100 , for k = 1, 2, . . . , 99. These are all exatly dened for ontinuous random variables, but not for disrete ones, where there quintiles are obvious ambiguities. There are onventions set up to dene a unique quantile for disrete random variables, so that your spreadsheet will ome up with a spei number, but they are just that onventions. For 1 4 to the values 1, 2, 3, 4, any number between 2 and 3 qualies as a median, even if the most example, if your random variable assigns probability ommon onvention is to all 2.5 the median. If a distribution has a density with a graph symmetri around a value, that value will be equal to the expetation and to the median that's the ase of a normal distribution, where both are equal to µ. Laking this symmetry, the two are dierent: 1 λ , while the median is the the expetation of an exponential variable is solution to the equation 1 − e−λx = that is 1 2 1 2 λx = ln 2 ln 2 0.693 1 x= ≈ < λ λ λ e−λx = Thus, an exponential random variable is more likely to take values smaller than the expetation, rather than larger. • Mode : This is an index we will have no use for, and it is only meaningful for (some) disrete variables. This is the value for probability • 12 . Mean Absolute Deviation : X that has the highest We will not have any use for this as well. It is the measure of dispersion best assoiated with the median: if median m, the MAD is E [|X − m|]. X has There are statistial appliations for this measure and its assoiated index, the median, but they are tehnially less easy than the more ommon team expetation-variane. You will see that some books dene the MAD using the expetation instead of the median in the formula above, but this is a poor hoie, with no rigorous reason. We will onern ourselves only with parametri statistis. This means that we will observe random events that we assume an be desribed with a spei distribution form (e.g., normal, exponential, and so on), so that our problem 12 For absolutely ontinuous random variables, the mode is often dened as the value(s) for whih the density has a maximum. 11 onsists in identifying the parameter(s) that haraterize the distribution and whih, in all pratial ases, are expetation and/or variane and/or other moments, or immediately onneted to them (e.g., the expetation of an exponential random variable is the reiproal of the parameter we denoted by λ). Also, we will only look at (absolutely) ontinuous distributions. One an apply this methodology to disrete distributions, but this requires more work, and is not as ommon in pratial appliations. Part II. Estimating Distributions 4 Simple Random Sampling Suppose we want to estimate the onentration of a given hemial in a body of water. We don't know what the onentration is so we onsider the result of the measurement random, a random variable, say C. If we take more than one measurement, we an be ondent that we will get dierent numbers, beause of a variety of fators (the onentration will not be exatly uniform over the body, our measuring instrument will have its own irregularities, and other fators that we may not even be aware of ). We model this as a probability distribution for our random variable C. We will have to assume that this distribution belongs to a spei family, sine we are only doing parametri statisti, and we will want to identify, as an example, the mean of this distribution. The main tool for this projet is to perform several observations, all under the same ondition, and in a way that outomes do not aet eah other, whih is modeled as observing a number of independent identially distributed random variables, what is known as a simple random sample : C1 , C2 , C3 , . . . Cn . The outome of this experiment is a set of numbers ame bak the next day and took n c1 , c2 , c3 , . . . , cn . Of ourse, if we more samples, the outputs would be most likely dierent from these. Two basi theorems motivate the alulation of a rst summary for these observations: 4.1 The Law Of Large Numbers This theorem (abbreviated LLN), possibly the rst hard theorem in probability, says that Given a simple random sample tion with expetation positive number (where ε δ, µ, as n X1 , X2 , . . . , Xn , from a distribu- beomes larger and larger, for any the probability Pn Xk − µ < δ > 1 − ε P k=1 n is any given positive numbers presumably very small). In words, the probability of the sample mean to be very lose to the 4 Simple Random Sampling 12 expetation beomes lose and loser to 1. If δ and ε are very small, it is almost ertain that the rst expression will be pratially equal to the expetation however, this may require a very large sample. Pn k=1 Xk Thus, the quantity X = , alled the , will be very lose to n the ommon expetation with extremely high probability. In many rough and sample mean tumble appliations, we may thus make a number of observations, and take the resulting sample mean as a reasonable estimate of the expetation. This is often expressed by saying that the sample mean is an estimator of the expetation. The sample mean ould have a very dierent distribution than that of eah observation. However, we may note the following properties: if the (indepen2 dent) observations have all expetation µ and variane σ , E X = µ, σ2 V ar X = n (2) 4.2 The Central Limit Theorem The LLN tells us that the sample mean will very likely be lose to the expetation, but we would like to know how likely this is. In general, this is a diult estimate, but, for reasonable distributions, there is a shortut, that we will use extensively. This theorem (abbreviated CLT) says, very roughly, that for large enough simple random samples, if the original distribution is nie enough (tehnially, this means that at least four moments are dened), the sample mean will be approximately distributed as a normal random variable. Tehnially, this is expressed by saying that P beomes loser and loser to as n √ X −µ n ≤t σ Φ(t), the df of the standard normal distribution, grows. We may want to remark that if the individual observations in a random sample are normally distributed, N (µ, σ), the sample mean, X is also normally σ distributed, as N µ, √ . One again, for this approximation to be eetive, n the sample size n has t be large enough. How large? That depends on the spei distribution we are working with. As long as the distribution we use as a model is reasonably symmetrial, and reasonably onentrated, the approximation kiks in fairly soon 13 This theorem an be read in a few dierent ways. One that has many appliations in modeling reads the statement as stating that n X Xk − µ √ n k=1 13 A lassi ase is a uniform distribution over [0, 1], where a sample of size 12 an be assumed to be large enough, so muh so that for many years IBM omputers used this fat to simulate a normal distribution, by adding 12 uniform random variables. 4 Simple Random Sampling 13 is approximately normal with mean 0 and variane σ2 . If n is large, eah the sum of many small independent variables with mean 0 is approximately normal. This justies, for term is a small random variable with mean 0. Thus example, Maxwell's law for gases, stating that the veloity of a gas partile is normally distributed: this veloity is the result of very many very small ollision eah partile undergoes with the other partiles, and the umulative eet results in a normal distribution. A similar argument justies the standard theory for measurement errors: measuring a physial quantity, say the mass of an atom, is a high preision operation, but outomes are aeted by very small inontrollable external fators (osmi rays, minimal earth temblors, ...) so we may assume that we are observing a normal random variable, entered on the true value of the mass. Albert Einstein's Nobel prize was not awarded in reognition of his Theory of Relativity, but rather on his previous work, inluding his theory of Brownian Motion, the errati motion of partiles suspended in a liquid, observed by the biologist Brown a few deades earlier. Einstein suessfully modeled the motion of the partile by assigning a normal distribution to its position at any instant of time, as a result of many small impats of the moleules 14 . in the liquid with the partile We will rely on this theorem extensively, but we need to remember that it applies to olletions of the speed at whih Φ independent identially distributed observations. Also, is approahed depends on the original distribution: the less symmetri it is, the slower the speed. Forgetting any of these onditions an lead to erroneous onlusions 15 . 14 This result was expanded and made more rigorous and also riher by Norbert Wiener, the Brownian Motion, another deade later, leading to a vast Random Proesses. proper father of the mathematial new eld, the theory of 15 By now, a standard example of erroneous onlusions is provided by the housing rash of 2007. The ompliated nanial onstrutions built around housing mortgages was prediated on the assumptions that default risks were essentially normally distributed. This may be reasonable in a normal environment, where one household's default need not have any eet on others. In the sub-prime lending frenzy, however, defaults snowballed, reating an unexpeted large risk, as banks and eventually governments realized the hard way. Careful observers, inluding ones in the lending ommunity, were aware of this threat, but alls to aution were largely ignored before it was too late. An otherwise very areful, somewhat less elementary, textbook argues that, without any real information on the distribution underlying a statistial experiment we should automatially assume it is Gaussian. While this orresponds pretty muh to ommon pratie, it is not neessarily a wise hoie under any irumstane, as the market rash of 2007 illustrated starkly. 4 Simple Random Sampling 14 4.3 Empirial distribution and estimators for expetation and variane If we observe a simple random sample, resulting in n numbers x1 , x2 , . . . , xn , from a distribution with (theoretial) expetation µ and (theoretial) variane σ 2 , we may want to use our sample to get a grip on these values. The LLN tells us that n 1X xk x= n k=1 is an estimator for the expetation as a random distribution µ. In fat, it is useful to onsider our sample (alled the empirial distribution), where the possible 1 n . Of ourse, if we repeated this experiment, the numbers obtained would be dierent, and hene values are the n observations, and eah is given probability the distribution would be dierent hene the random qualiation. As you will notie, an empirial distribution is a (disrete) uniform distribution over the observed values. Thus, the sample mean is the expetation for the empirial distribution. Applying our denition of variane, we have that the analog for the empirial distribution is given by n 1X 2 S = (xk − x) n 2 (3) k=1 (the notation S2 it onfusing with the variane of a non random S2 σ 2 , but that makes distribution). Note that x, and is not partiularly ommon the book uses are values taken by random variables, sine another round of observations will produe dierent values. It is useful to note (it is not hard to see this, with only a little algebra) E X = µ, n−1 2 σ E S2 = n unbiased Tehnially, the terminology is that X is an estimator for the ex2 estimator for the variane. This is of minimal petation, while S is a biased signiane in pratie (in statistis, we don't know the value of the expetation or the variane anyway). More signiant is the fat that, thanks to the Law of Large Numbers, both are onsistent, that is are more and more likely to be loser and loser to the true values as n gets large). n grows (note that n−1 n ≈1 as soon as Atually, looking at the df of the empirial distribution, it an be shown that, as n grows, it will approah the df of the original distribution. This is the basis for many non parametri statistial proedures, whih, however, we will have no time to onsider. Finally, note that an empirial distribution is just another disrete distribution (albeit a random one), so other indexes, suh as quantiles, median, mode, and so on all apply (being disrete, quantiles are usually ambiguous, with dierent onventions adopted by dierent authors to resolve the, atually harmless, ambiguity). 4 Simple Random Sampling 4.3.1 The n−1 15 story Early in the 20th Century a quality ontrol employee of the Guinness brewery in Dublin named Gosset developed a new statistial proedure to estimate the quality of the business's brew, to remarkable suess. His proedure was based on a ombination of the sample mean and the variane dened in (3). His work was greatly appreiated by one of the main founders of modern lassial statistis, the biologist Ronald Fisher. Fisher however was also enthralled by a way of lassifying random quantities onstruted from samples that has useful traits in the Gaussian ase, but not otherwise. Gaussian models are the most ommonly used in lassial statistis, so this is not neessarily an odd hoie, but it has no bearing on the partiular problem that Gosset addressed. Nonetheless, 2 Fisher judged that the proper estimator for σ was not (3), but the unbiased orretion n s2 = 1 X 2 (xk − x) n−1 (4) k=1 whih is now ommonly alled variane. sample variane, with (3) oddly alled population Further, Fisher reworked Gosset's method using (4), rather than (3). Sine Fisher wielded (and still wields) enormous authority, this terminology and his reworked method beame the standard, but you should be aware that the s2 instead of S 2 is purely a historial aident, with no greater dominant use of signiane. One obvious fat is that stead of n − 1), S 2 < s2 (the sum of squares is divided by n inσ 2 , s2 is so that, if you are using these as bare estimators for more pessimisti 16 . However, this is a rough proedure and as far as the more areful proedure we will look at it makes absolutely no dierene (exept in tweaking formulas in minor formal details) whih one you use. This has not prevented some books to try to justify Fisher's hoie with bogus arguments. One absolutely galling one (in a widely used introdutory statistis textbook) 2 2 states outrageously that it is always true that S < σ . This is ridiulous of ourse (what is worse, the proof of this statement is based on a small ad ho example I'll be happy to produe a similarly irrelevant example where S 2 > σ 2 ). In fat, sine both empirial varianes are never negative but an, in priniple, take large values, their distributions (whih are pratially idential) are not symmetri, and, generally, the median will be smaller than their expetation (this an be heked expliitly in the Gaussian ase), so that 17 . both are more likely to underestimate the true variane 16 Of ourse, you will also notie that, as soon as we are looking at samples that are not very small, the dierene between dividing by n or by n−1 is almost irrelevant, espeially when data omes from experiments where the preision is limited. 17 Another, generally very areful, textbook states erroneously that the fat that s2 is unbi- ased means that half the time it will underestimate and half the time overestimate the true variane. Confusing the median with the expetation is quite a slip for an otherwise arefully written text. 5 Other sampling methods 16 5 Other sampling methods In our ourse, we will always assume that the observations onstitute a simple random sample, but in real life other, formally more elaborate methods are also in use. None of the tehniques we will look at applies diretly to samples obtained in these other ways, some of whih are atually useless for any rigorous analysis. To be sure, generating a proper simple random sample an be very hallenging, if not almost impossible, but alternate methods fae as many, if not more, hallenges. 5.1 Simple random sampling in pratie Dierent experiments may require dierent approahes. We look at two typial examples 5.1.1 Physial measurement This is a straightforward ase, only needing are in setting up the experiment. For example, repeated measuring the ontent of a hemial in a body of water an be thought as produing independent identially distributed results, if it is reasonable to think that thee is no reason for the onentration to be signiantly higher in one spot or another, and if we make sure that our instruments is reset every time, so that it has no memory of the previous measurement. Sine the variations in measurements may be asribed to many small eet, it would be reasonable to assume that the measurements ome from a normal distribution, σ 2 ) of and we may be looking at determining one or both parameters (µ and this distribution. 5.1.2 Polling This is more omplex, even if it is the situation you are most likely to have heard about. Taking a simplied model, assume you are wondering about the opinion of a population, let's say of the United States, about something. You annot interview every single individual in Ameria (not to mention that there are ontinual hanges as individuals depart, die, are born, immigrate 18 ), so you hoose a sample, that is a limited number of individuals, trying to extrapolate the outome of your inquiry to the whole population. In pratie, serious polls will interview at least 500 to 1000 individuals. How ould you hoose the people to interview so that the simple random sample model ould be reasonably applied? 18 The United States has a ensus every 10 years, when the Census Bureau tries to ount every individual living in the ountry on a given day. This is, obviously, not a simple random sample, and it present its own spei statistial problems. The general ensus does not go beyond ounting people, but, in order to gain more detailed information, a limited number of households are asked to omplete a long form, and this should be a random sample, or, possibly, a stratied sample, as desribed below. 5 Other sampling methods 17 The standard model, taken from lotteries, for this is to think of the population as a huge olletion of balls, marked 1 for I like and 0 for I don't like 19 . You mix all the balls and take out of the 350 million+ balls 500 or 1000. If the extration is done properly (say, the balls have been mixed thoroughly) it is reasonable to assume that if the proportion of 1s is ity p p, then there is a probabil- of piking a 1. The total number of balls extrated is a tiny proportion of the whole, so that, even if you are sure not to risk piking the same ball twie, the fat that on the seond, third... extration the proportion of 1s and 0s has slightly hanged (beause of the balls you already took away), the hange is not measurable. This implies that you an look at your total result as a binomial experiment (500, 1000, or whatever your sample size is) and p. The p and variane p(1−p) , but, more importantly, n due to the CLT, it will be approximately normally distributed (provided that with parameters sample mean, at the start p X n has expetation was not too lose to 0 or 1). What problems do you fae in applying this model? Many, and growing. First, you do not have 350+ million balls to shue, so you need to nd a method for hoosing people at random, with a omparable randomness as the one you get by extrating balls from a well shued urn. The traditional method relied on the fat that the vast majority of households was listed in telephone diretories, and methods were developed to pik numbers at random from these diretories. It was not a perfet method, but proved to be good enough for most ases. Reently, the massive swith to ell phone servie has made this method less and less reliable (in partiular, sine having no land line is partiularly frequent in younger generation, sampling this way reates an unwanted bias towards the older population). This issue is very high on the mind of pollsters and is thought to be at least a fator behind some massive failings in foreasts in reent eletions in many ountries. There are other issues with our model, that are intrinsi, and not due to soiologial trends. A big one is the fat that not everybody piked will answer either they will not be home when alled, or the just hang up right away. How to handle these no shows requires more math, and many polling businesses will adopt their own speial method. A more subtle problem is, of ourse, that people will lie, espeially if the question is deemed to be sensitive. That's something that requires even more reativity to be taken into aount. In politial polls, an extra issue is that in ountries like the US where many people do not vote at all, sampling the likely to vote segment is another deliate problem, as relying on self-desription as a likely voter is not neessarily a reliable method (another variation on the lying problem). 19 Of ourse, most polls have more than two possible answers, but the extension is easy, only requiring a little more math 5 Other sampling methods 18 5.2 Other methods 5.2.1 Stratied sampling It may be more pratial to divide the general sample into sub-samples by geography, soial features, and so on. One reason may be pratial: rather than extrating 1000 addresses out of 350 million, we might prefer to extrat smaller numbers from more limited groups. Another onern is that sampling from the whole population may easily miss some groups (minorities, low inome, speial ommunities) altogether, though they may be numerous enough and spei enough to aet the overall results. This is not so muh done to make the sample more representative (piking representative, typial samples is premodern statistis), but rather as an attempt to redue the unertainty intrinsi in sampling by ontrolling it, splitting it between groups (this sampling is onneted to variane-redution methods). The diulty in using this method is that to merge the separate polls, you need information on how the various sub-groups ombine into the population, that is you need preise soiologial data on the various sub-groups. 5.2.2 Systemati Sampling, often alled Every other k In quality ontrol on a prodution line, to hek if the produts live up to spes, one an start by hoosing an integer and after that pik the k−th, 2k−th, k > 0, then pik a rst item at random, ... items for inspetions. This proedure will produe a reasonable andidate for simple random sample only if some spei assumptions about the prodution line are satised. Of ourse, if the produts in the line have a quality that an be thought of as independent and identially distributed, it doesn't matter how you pik items (you ould just as well hoose k = 1). That would be quite a gutsy assumption. In general, to make this proedure work, you still have to assume a spei (even if fairly ommon) model for your line (in tehnial terms, it must look like a stationary random proess, with short-term orrelations (we skip the rigorous denition). This an be often reasonable, but if the line exhibited, for example, a periodi reation of defets, this proedure would fail ompletely. In other words, its validity depends on a good understanding of how possible defets might enter in the prodution proess. 5.2.3 Blok (Cluster) sampling Again, in quality ontrol, prodution items might be lumped in bloks, and one (or more) bloks hosen at random for quality testing. One again, the validity of this proedure depends on strong assumptions on how these bloks are formed. Otherwise, this an easily beome a speial ase of onveniene sampling, as desribed momentarily. 5 Other sampling methods 5.2.4 19 Conveniene sampling This is not even a statistial sampling method, and should not be listed together with the previous methods (but it is in most textbook, so here it is). It onsists in hoosing a sample that is right away available instead of piking one at random. For example, you ould walk to a mall and ask the people you ome aross there. Variations are self-seleted samples, as in all-in or in Internet polls, where the respondents volunteer their answers, whih is a variation on onveniene sampling. A famous example was the predition that Dewey would win the 1948 presidential eletion over Truman. The poll was the result of responses by the readers of a high end magazine, a typial self-seleted sample, taken from a spei small minority of the population. Any statement based on these proedures an be dismissed outright, sine it holds no more ontent than your personal opinion would. In general, onveniene sampling will pratially always (not simply often) produe biased, hene useless, results. It should be noted that some standard proedures, espeially in the soial and medial sienes, may lead, in a subtle way, to what is essentially a onveniene sampling. For example, many soial and psyhologial studies rely on samples onstruted from the student population of the institution(s) involved in the experiment. The fat that many of these experiments ould not be reprodued has been asribed (among other explanations) to dierent student populations, often from dierent ountries, having signiantly dierent responses to the irumstanes of the experiment. Similarly, in medial trials, the individuals are volunteers, hene self-seleted. While eorts are made to limit the bias that this may produe, there is always the possibility that failure to repliate the results of a trial may be due to the onveniene sampling risk that is impliit in these methods.