Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Some Topis for 146 1 Desriptive Statistis 1.1 Reapping formulas For an experiment resulting in 1 n n data points, • The mean: • The population variane: • The population standard deviation: • The sample variane: • The sample standard deviation: • The range: • The midrange: • x̄ = Pn we may dene xi 1 n σ2 = s2 = 1 n−1 2 Pn (xi − x̄) √ σ = σ2 i=1 Pn 2 (xi − x̄) = √ s = s2 i=1 n 2 n−1 σ R = max {xi } − min {xi } The quantiles: is less than i=1 x1 , x2 , . . . , xn , min{xi }+max{xi } 2 Qα = min {xi } + R 2 = max {xi } − is a value suh that a fration α (100α%) R 2 of the data Qα as speial ases we have α = 0.5: the median α = 0.25, 0.75: the 1st and 3rd quartile α = 0.01, 0.02, . . . , 0.99 : the perentiles • The average deviation is best dened in terms of the median m = Q0.5 : n d= 1X |xi − m| n i=1 although you will also nd alulated plaing mula. 1 x̄ instead of m in this for- 1 Desriptive Statistis 1.2 2 Formulas for the variane Let's go over this algebra point again. The numerator in the variane formula is n X i=1 1 Pn i=1 n this is equal to where x̄ = xi . n X i=1 (sine x̄ (xi − x̄)2 (1) Expanding the square (reall that 2 (a − b) = a2 −2ab+b2 ) n n X X xi x̄ + nx̄2 x2i − 2 x2i − 2xi x̄ + x̄2 = i=1 i=1 is a xed number now). We also have that n X xi = nx̄ i=1 (from the formula for x̄), n X i=1 so the expression is equal to x2i − 2nx̄2 + nx̄2 = n X i=1 x2i − nx̄2 If we use the formula for the population variane (divide the sum (1) by n), we nd n σ2 = 1X 2 x − x̄2 n i=1 i (2) (as already noted previously, around the rst test).If we use the formula for the sample variane (divide by n − 1), n s2 = we have n 1 X 2 x − x̄2 = n − 1 i=1 i n−1 Pn x2i − nx̄ n−1 i=1 (3) As you may notie, using either (2), or (3), depending on our hoies, we may ompute the variane (and, orrespondingly, the standard deviation), by using, as summary data, the sum of the squares of the data, and the sum of the data - no need to keep trak of eah individual datum. 2 Probability Formulas 1.3 3 Whih measures should we worry about? We enountered several measures. For example: • measures of entral tendeny: mean, median, midrange • measures of dispersion: variane (in two avors)/standard deviation, average deviation, Inter Quartile Dierene, range, 1st and 3rd Quartile, minimum and maximum While all are used in traditional desriptive statistis (advaned desriptive statistis is going into more sophistiated territory, using elaborate omputer based graphing representation tehniques), most are not suited for inferential use. This term refers to the use of mathematial models for the observation proess that allow quantitative evaluations of the estimates we make from the data. The main measures used in this latter setting are • the mean • the variane/standard deviation These measures are suient when the most ommon mathematial model for our observation is appropriate. Sometimes this is not the ase (even though you may nd it applied nonetheless - it is so easy to use, that it gets abused more than you would expet). In suh ases we may want to refer to perentiles (as many as pratial) as a more exible tool. In the most general ase, the full set of observations may be needed to reah a reliable result. 2 Probability Formulas 2.1 The basis We assume that we are given • • • a sample spae S all (or many) subsets of S, alled events: A ⊆ S. We may operate on events using unions (A ∪ B ), intersetions (A ∩ B ), and omplements c (A = S − A: the set of all elements in S that are not in A) a probability: to eah event A we assoiate a number P [A], with the following properties: 0 ≤ P [A] ≤ 1 P [S] = 1 P [Ac ] = 1 − P [A] P [A ∪ B] = P [A] + P [B] − P [A ∩ B] From these assumptions we an derive many interesting onsequenes. To do this, we have to add a few additional notions that build on this setup. 2 Probability Formulas 2.2 4 Conditional probabilities We dene the following quantity (the probability of an event an event P [A |B ] = P [A ∩ B] P [B] This is interpreted as the adjusted probability of B A, onditioned on B ): (4) A, if we know (or assume) that has happened (is true). As an example, in the simplest mode of the toss of a die, if we take • S = {1, 2, 3, 4, 5, 6} • P [{1}] = P [{2}] = P [{3}] = . . . = P [{6}] = • A = {5}, B = {1, 3, 5} (A 1 6 is the event the die landed with fae 5 up; B is the event the die landed with a fae with an odd point up), we have P [A] = 1 6 P [B] = P [{1}] + P [{3}] + P [{5}] = A∩B =A P [A |B ] = P [A] P [B] = 1 6 1 2 = 2 6 = 1 2 1 3 In this example, if we know that an odd point has appeared, the probability 1 1 that it may be 5 inreases from 6 to 3 . Formula (4) an be rewritten as P [A ∩ B] = P [A ⌊B ] P [B] Sine it also follows that P [B ⌊A ] = P [A ∩ B] P [A] so that P [A ∩ B] = P [B ⌊A ] P [A] we onlude that P [A ⌊B ] P [B] = P [A ∩ B] = P [B ⌊A ] P [A] One way of using this formula is to observe that it implies P [A ⌊B ] = P [B ⌊A ] P [A] P [B] (Bayes' Formula). This is a very useful tool (as an example, see the relevant problem in the assignment due July 27) 2 Probability Formulas 2.3 5 Independene P [A |B ] = P [A], then P [A ∩ B] = P [A |B ] P [B] = P [A] P [B], P [B ⌊A ] = P [B]. In this ase we say that the two events, A and B , are independent. Sine, as already If it happens that whih, by the symmetry of this formula, implies that also noted, in this ase P [A ∩ B] = P [A] P [B] many alulations are greatly simplied when dealing with independent events. However, this is an assumptions that has to be examined ritially when it is applied to a real situation. 2.4 Random variables In pratie, we deal very rarely with single events. What we mostly deal with are funtions dened on the sample spae. For example, the die model disussed in setion 2.2 an be formulated as follows. Let S be a spae where eah point represents the possible outome of the toss of a die. Here, you an onsider any number of features as part of the outome. We now dene a funtion X : S → {1, 2, 3, 4, 5, 6}, where X = k for k points up. This is all outomes where the die ends up with the fae with a more exible framework, as it allows us, at least in priniple, to analyze the experiment toss of a die in muh greater detail, if we felt it neessary. Now, our event A {X = 5}, Similarly, B = {X = 1, 3, 5}. P [X = x], where x runs over all possible val(probability) distribution of X . If the possible values of is dened as The olletion of probabilities ues of X X, is alled the are too many to be aounted for individually, we resort to keeping trak of probabilities of events like {a ≤ X ≤ b}, {X > c}, {X < d}, and so on. The full distribution of a random variable an be a pretty ompliated objet, as soon as the possible values of the variable are a large number. We introdue then several summaries of distributions, in analogy with the measures we introdued in desriptive statistis. We will make a strong onnetion between these two sets of summaries/measures momentarily. We dene • The mean, or the expeted value of X P EX = µX = x xP [X = x] for x,with weight equal to the as (this is the weighted average of all values probability of eah value) xk P [X = x] P k k • The k th entered moment of X as mk = E (X − EX) = x (x − EX) P [X = x] • The k th absolute moment of • In partiular, for P 2 x (x − EX) k = 2, we an also dene quantiles α X, as EX k = we dene the P x variane of X as 2 m2 = σX = Qα : they are numbers suh that P [X ≤ Qα ] = 2 Probability Formulas 6 X Moments are far from exhaustive information on (at least when we know only a few of them), but they are usually muh easier to handle than the full distribution. 2.5 Conditional expetation If we have two random variables, X and Y, with their respetive distributions, it may be useful to onsider their onnetions, and this is done through the introdution of ments onditional distributions, and, onsequently, of onditional mo- (espeially, onditional means, and onditional varianes). In fat, we an easily dene P [X = x |Y = y ] = P [{X = x} ∩ {Y = y}] P [{Y = y}] One we have this, we an also dene, for example, E [X |Y = y ] = X x xP [X = x |Y = y ] (note how this turns out to be a funtion of y ). These tools turn out to be extremely versatile in exploring probability models, but we will avoid delving too deeply in this diretion. 2.6 Independene of random variables One item that we want to explore in more detail is independene, when it omes to random variables. Sine a random variable is haraterized by the totality of its events {X = x}, rather than by one only, it is natural to dene Two random variables X and Y are said to be independent if P [X = x |Y = y ] = P [X = x] for all x and all y. It turns out that this is the useful notion of independene, rather than the notion of independene of single events. The assignment due July 27 has a problem where this is illustrated learly: studying the toss of two die, by a freaky oinidene, we nd that two events that are learly onneted happen to be independent. However, if instead of looking at these two events in isolation, we onsider the two random variables that dene them, these two (again, learly onneted) are very far from independent. 3 Conneting with statistis 3 7 Conneting with statistis A rst onnetion with our basi problem, dealing with observations, with probability is the following observation. Suppose we observe some quantity repeatedly (e.g., we make several measures of a physial quantity). Say we observe the values x1 , x2 , . . . , xn . We an onstrut a probability model for this experiment that says nothing more than the data itself, but may suggest a way to onnet to a more general model. We assume there is some sample spae whih takes values x1 , x2 , . . . , xn . S, and a random variable X on S, Sine we have no reason to onsider any of these values more signiant than any other, we assume that the distribution of X is given by P [X = xk ] = for all k = 1, 2, , . . . , n. 1 n This distribution is alled an empirial distribution. We now notie that, for example, • the mean of the data • the population variane of the data is equal to • the quantiles Qα x̄ = EX 2 σX of the data are the quantiles of the distribution of X In other words, eah experiment an be seen as a probability model. The deliate issue is that, in ase we should repeat the experiment a seond time, the model will be most likely dierent. The deeper way to onnet these two topis is now at hand. 3.1 Inferential statistis We look at a statistial experiment as follows. We assume there is some sample spae X S, and some random variable X modeling the quantity we are observing. has a distribution, and we want to get information on this distribution. To this end, we make a number of repeated observations of X. Hopefully, these re- peated experiments are all idential and independent (if not, there a re ways to handle the situation, but the alulations beome muh more omplex). This is the same as saying, we have n independent random variables X1 , X2 , X3 , . . . , Xn , with their distributions, whih are all the same (suh a olletion is alled an i.i.d olletion, for independent, identially distributed. We atually observe n values, x1 , . . . , xn . If we new the distribution of X , and hene of the n i.i.d variables Xi , we ould alulate the probability P [X1 = x1 , X = x2 , . . . , X = xn ], but the problem is that this distribution is preisely what we are trying to disover! To solve this paradox, we take a reasonable attitude: we deide to rely on the fat that events that have small probability are not likely to our - if something happened, it should have a reasonably large probability. Of ourse, suh a statement is not terribly strong: exept for impossible events, anything 3 Conneting with statistis 8 an happen. However, most of the time, we an assume that what happens has a reasonably large probability of happening, and that small probability events are very rare. If so, we will be wrong only rarely... All the above may be reasonable, but is only a delaration of priniple. We need to translate this into a pratial strategy. This is what we will start doing in the next hapter. In the meantime, we quote two impressively general results, whih will be of great help in building our pratial strategy. Their proof varies from pretty easy, to somewhat tehnial, but we will skip both. If you are urious, a separate le will provide some info. 3.2 Limit theorems The basi strategy in statistis is that more observations mean better results. To this end it is useful to know two mathematial theorems that takle the issue of what happens when you are having very many observations. 3.2.1 The Law of Large Numbers (LLN) This is not to be onfused with the empirial 'law' of large numbers that we disussed in lass. The latter is an empirial observation that seems to apply to some well organized irumstanes. The law we are quoting here is an abstrat mathematial theorem. In a slightly less general form than tehnially possible it states: Let X1 , X2 , . . . , Xn , . . . be a sequene of i.i.d. µ. Then X 1 n P i=1 Xi − µ > δ < ε n random variables, with mean provided we hoose n large enough. What this means, pratially, is that if we make a very large number of independent, identially distributed observation, the arithmeti mean of our results is very lose to the theoretial expeted value of the random variable we are studying. In other words, our x̄ is going to be lose to the theoretial value EX = µ, if we make enough observations. This is a big deal, with the only downside that there is no pratial indiation of how large to µ. n must be to be sure (or, at least, ondent) that To help somewhat, here is the next major theorem. x̄ is lose enough 3 Conneting with statistis 3.2.2 9 The Central Limit Theorem (Central here refers to the fat that this theorem is entral in statistial appliations, as we will see - there is no other enter that it is referring to) Suppose we have, again, a sequene of i.i.d. random variables, with mean σ 2 . Then µ, and variane " 1 P a≤ √ nσ 2 n X i=1 # (Xi − µ) ≤ b ≈ Φ (b) − Φ (a) (5) 2 x √1 e− 2 2π (this is the famous bell urve, or Gaussian urve), up to horizontal oordinate where Φ (x) is a funtion that returns the area under the urve ϕ(x) = x: This is not a funtion that is easily omputed, hene it is pre-programmed in any statistial software pakage (as well as in any spreadsheet), and is also tabulated in any statistis or probability book. The ≈ is meant in the sense that the dierene between the two sides of (5) beomes smaller and smaller as n grows. With a little reetion, we may notie that this result does help in evaluating how far we have to go so that x̄ is lose to µ. What is frustrating is that we are just shifting the problem by one noth, as there is no help in learning how far we have to go so that the two sides of (5) are as lose as we would like. There are further results to help even in this, but we'll leave the topi for our future probability lass (if any). 3.2.3 Wrapping Up It would be interesting to dig a little around these two theorems and learn more about what they really mean, but, for the time being, let us limit ourselves to these two vague statements, that we will be relying on a lot. 1. The LLN states, intuitively, that x̄ ≈ µ, at least if we have enough data 2. The CLT an be rewritten (with just a little algebra) as P a≤ 1 n Pn Xi − µ √ ≤ b ≈ Φ(b) − Φ(a) σ/ n i=1 A random variable suh that the relation above holds exatly is alled Gaussian, or Normal. Hene, the CLT is saying that (assuming we knew µ and σ ), (essentially) regardless of the distribution of mean, alulated as 1 n X, the modied Pn Xi − µ √ σ/ n i=1 will be (at least approximately) a Gaussian random variable. 3 Conneting with statistis Remark/Exerise: with mean 10 See if you an gure this out: given 2 and variane σ , the mean n i.i.d random variables, µ, n 1X Xi n i=1 σ2 n . Now, look at the previous version of the CLT, and notie the onnetion. has mean µ,and variane