Download Module 6 - Wharton Statistics

STAT 101, Module 6: Probability, Expected Values, Variance (Book: chapter 4) Reasons for wanting probability statements  Quantify uncertainty: “For the manager the probability of a future event presents a level of knowledge” (p. 79).  Quantification makes uncertainty “predictable” in that probabilities describe what happens in large numbers. For example, a company can figure an assumption of 5.5% faulty shipments from its supplier into its operations planning.  Many statements about “rates” are hidden probability statements: “The take-rate of our services from a telemarketing campaign is 2.1%.” This means that in the past and presumably in the future about 21 out of 1000 targeted households sign up for the offered service. We can say the estimated probability of taking the service after targeting by telemarketing is 0.021.  Quantification with probabilities can give insight. Some examples found by googling “probability of successful corporate take-over”: o “The probability of merger success is inversely related to analyst coverage at the time of announcement.” o “The probability of takeover success rose to 71.3% when a bidder initially held the shares of the target firm.” Probability as an idealization  Probability is an idealization. It is to statistics what straight lines are to geometry, point masses to Newtonian mechanics, point charges to electrodynamics,…  The idealization comes about for relative frequencies when the number of cases goes to infinity. The limit is the probability. (Book: p.90)  Probabilities do not “exist”, just like straight lines don’t “exist”. Relative frequencies and taught ropes exist, but probabilities and straight lines do not “exist”. Both are constructed by a thought process called “idealization”.  Why do we need idealizations? Idealizations simplify our thinking and focus on aspects of reality that are essential in a given context. o In geometry we don’t want to talk about the thickness of a line. Thickness matters for taught ropes, but when we do geometry we “idealize away” thickness by thinking of a line as “infinitely thin”. o In statistics we need probability to have a target that is estimated by relative frequencies. We can think of the probability as the relative frequency if we had an infinite amount of data. Like geometry does away with thickness of taught ropes, probability does away with limitations of finite data samples. Foundations of frequentist probability  Probabilities are numbers between 0 and 1 (inclusive) assigned to     outcomes of random events. A random event can have one of two outcomes, which we may characterize as “yes” or “no”. Examples of random events: o head on a coin toss, o 6 on the rolling of a die, o ticket 806400058584 is drawn in blind drawing from a welltumbled box of lottery tickets, o any trader committing insider trading in 2007, o any targeted household signing up for my firm’s services, o a randomly caught bear is tagged and found next season again, o GM stock gains by more than 5% in a given week, o the S&P loses more than 3% in a given day,… The term “random” does not mean haphazard; it means “repeatable in principle, but with unpredictable outcome”. The term “unpredictable” is relative to what we know; if we know more, less is unpredictable, and vice versa. For a random trader on Wallstreet, we would not know whether he/she would do insider trading; if the trader is my dad or mom, I might be certain that he/she wouldn’t. Hence for a “random trader” we might estimate the probability of insider trading to be 1%; for mom or dad there is no probability because there is no repeatability, and I know.  It should be possible to “independently repeat” the random event. Independence means knowing the outcome of the first realization of the event should not help in guessing the outcome of a second realization of the event. Independence of realizations of a random event is sometimes difficult to justify. For example, if it rains today, chances are probably elevated that it will rain tomorrow, too. Hence there is information in the outcome of the first realization for the outcome of the second realization.  A counter-example: the event that a whale jumps on my yacht. Why is this not a random event? It only defines the ‘yes’ side of an event but not the ‘no’ side. To actually have an event in our sense, one has to be able to determine when it did not occurred. The ‘no’ side of an event is needed so we have a notion of total number of trials/observations/measurements. To make “a whale jumps on my yacht” a well-specified event, one would have to modify it to something like “the event that a whale jumps on my yacht on a given day”, or “the event that a whale jumps on a given yacht throughout its lifetime”, or “the event that a whale jumps on any yacht anywhere on a given day”, or “the event that a whale jumps on a given yacht on a given day”. These specifications define the repeatability as across days, or yachts, or combinations of days and yachts. Day-to-day repetitions may be justified as independent if one thinks that next days jumping of a whale on a yacht is not linked to today’s behaviors of whales and yachts. A similar thought applies to events such as 5% daily gains of GM stock and 3% daily losses of the S&P index. Is it so clear that tomorrow’s likelihood of these events is unrelated to today’s movements of GM stock or the S&P index?  The preceding thoughts show that there exist deep problems in applying probability. There exist more free-wheeling notions of probabiliy, called “subjective probability”. It formalizes strength of belief and can be founded on betting: how much money one is willing to bet on an outcome implies a belief in a certain probability of the outcome. Our notion of probabiliy, however, is “frequentist”.  Definition: Probability is the limit of relative frequencies of “yes” outcomes when the number of observed outcomes goes to infinity. Recall the definition of relative frequency: rel.freq.(yes) = #yes / (#yes + #no) Do we ever know probabilities?  No we don’t. But for thought experiments we often assume them.  Example: Flipping a coin; we assume P(head) = P(tail) = ½. This is actually taken as the definition or model of a fair coin. Similarly, we’d take P(any particular card) = 1/52 as the definition of a well-shuffled deck. Finally, P(any of 1…6) = 1/6 is the definition of a fair die.  For a finite sequence of independent repetitions of a random event, we can estimate the probability by the relative frequency. Later we will learn how to estimate the precision of this estimate with so-called “confidence intervals”. PS: In past long-run experiments with flipping coins one has never seen relative frequencies as close to ½ as they should be. We can say that fair coins do not exist. Yet the deviations from fairness are so small that they don’t matter. Notations, definitions, and properties of probabilities  A sample space is the set of all possible outcomes of a random experiment. Notation: Ω  A random event A is a subset of the set Ω of possible outcomes.  The set of outcomes that are not in A are denoted AC. Note that this is also a random event.  Events A and B are said to be disjoint if A∩ B = Ø. That is, events A and B cannot occur simultaneously.  The probability of a random event A is written P(A).  Probabilities must satisfy the following axioms: o P(A) ≥ 0 o P(Ω) = 1 o P(A or B) = P(A) + P(B) if A and B are disjoint.  Illustrations, using the example S&P losses and gains: o P(daily loss ≥ 3%) ≥ 0 o P(daily loss or gain or same) = 1 o P(daily loss ≥ 3% or daily gain ≥ 3%) = P(daily loss ≥ 3%) + P(daily gain ≥ 3%)  The axioms are direct consequences of the fact that probabilities are limits of relative frequencies. The same properties hold for observed relative frequencies: o #(daily market losses ≥ 3% in last 1000 trading days)/1000 ≥ 0 o #(daily losses or gains or same in…)/1000 = 1 o #(daily losses ≥ 3%... or daily gains ≥ 3% in...)/1000 = #(daily losses ≥ 3% in...)/1000 + #(daily gains ≥ 3% in...)/1000 Properties of probabilities (Book: Sec. 4.3)  Again, note that all of the following properties also hold for relative frequencies.  Complement rule: P(AC) = 1 – P(A) Proof: A and AC are disjoint, hence P(A)+P(AC) = P(Ω) = 1 Illustration: P(daily loss ≥ 3%) = 1 – P(daily loss < 3% or a daily gain or same)  Monotonicity rule: A contained in B => P(A) ≤ P(B) Proof: B = A or (B w/o A), where A and (B w/o A) are disjoint. Hence P(B) = P(A) + P(B w/o A) ≥ P(A) because P(B w/o A) ≥ 0. Illustration: P(daily loss ≥ 5%) ≤ P(daily loss ≥ 3%)  General addition rule: For any two random events A and B, not necessarily disjoint, we have P(A or B) = P(A) + P(B) – P(A and B) Important: The conjunction “or” is always used in the inclusive sense, meaning “A or B or both”, not “either A or B but not both”. In the inclusive sense, “or” expresses set-theoretic union. Proof: The intersection C = (A and B) is double-counted by P(A)+P(B). Draw a Venn-diagram to make this clear. For a formal proof, write: A = (A w/o B) or (A and B) B = (B w/o A) or (A and B) (A or B) = (A w/o B) or (A and B) or (B w/o A) and note that the events on the right hand sides are all disjoint, hence their probabilities add up. Illustration: P(earthquake or hurricane) = P(earthquake) + P(hurricane) – P(earthquake and hurricane)  The summation axiom can be generalized to more than two disjoint events. Here is the generalization to three: P(A1 or A2 or A3) = P(A1) + P(A2) + P(A3) if A1, A2, A3 are pairwise disjoint. Example: A1 = (S&P drops between 0 and 3% today) A2 = (S&P stays same today) A3 = (S&P rises between 0 and 3% today)  Partitioning rule: If B1, B2, B3 partition Ω in the sense that B1 or B2 or B3 = Ω and (B1 and B2) = (B1 and B3) = (B2 and B3) = Ø , and if A is an arbitrary event, then P(A) = P(A and B1) + P(A and B2) + P(B and B3) This formula generalizes of course to partitions of Ω consisting of more than three events. Example: B1 = market drop, B2 = market unchanged, B3 = market rise A = (GM rises by more than 1%) Conditional probability (Book: p. 99, Sec. 4.4)  Start with relative frequencies and pictures, a familiar mosaic plot: SURVIVED 1.00 ye s 0.75 0.50 no 0.25 0.00 1s t 2nd 3r d cr ew CLASS The skinny spine on the right shows in blue the overall relative frequency of survival, across classes. The spines of the mosaic, however, show the relative frequencies of survival conditional on being in a given class. Because concepts and properties of relative frequencies carry over to their limits for infinitely many cases (Titanic with infinitely many passengers, imagine!), the concept of conditional relative frequency carries over to the concept of conditional probability. P(A|B) := P(A and B) / P(B) That is, restrict the event A (“survival”) to the subset B (“1st Class”) by forming (A and B), and renormalize the probability to P(A and B)/P(B), such that the event B has conditional probability 1: P(B | B) = 1.  Example: What is the conditional probability of an outcome of A = {1} conditional on the outcomes B = {1,2,3}, assuming a fair die?  Example: What is the conditional probability of an outcome of A = {Ace} conditional on the outcome B = {spade}, assuming a wellshuffled deck?  Example: What is the conditional probability of a daily market drop of over 10% given that the market drop is over 3%? (There are no numbers given here; this is only a small conceptual point about two events where one implies the other.) Ramifications of conditional probability (Book: Sec. 4.4, 4.5)  Sometimes the conditional probability P(A|B) is given, and so is P(B). We can then calculate P(A and B): P(A and B) = P(A|B) · P(B) Example: Assume if it rains today it rains tomorrow with probability 0.6. The probability of rain is 0.1 on any day. What is the probability that it rains on any two consecutive days?  In other applications, the conditional probability P(A|B) is given, and so is P(A and B). Then one can calculate P(B): P(B) = P(A and B) / P(A|B) Example: If you know a plausible example, let me know…  A famous application of conditional probability is Bayes’ rule. In its simplest case, it allows us to infer P(B|A) from P(A|B), given also P(A) and P(B). (Book: Sec. 4.5) P(A|B) = P(B|A) · P(A) / P(B) Example: The conditional probability of getting disease Y given gene X is 0.8. The probability of someone having gene X is 0.0001, and the probability of having disease Y is 0.04. What is the conditional probability of having gene X given one has disease Y? Instead of applying the above formula, it might be easier to calculate P(B|A) to P(A and B) to P(A|B) in steps: P(gene X) = 0.0001 P(disease Y | gene X) = 0.8 P(gene X and disease Y) = 0.8 · 0.0001 = 0.00008 P(gene X | disease Y) = P(gene X and disease Y) / P(disease Y) = 0.00008 / 0.04 = 8/4000 = 0.002 Are you surprised? It seemed that if the gene causes the disease with such high probability, having the disease should be a pretty good indicator for the gene? What’s the hook? Exercise: Change the numbers in the example. What happens if the probability of having disease Y is 0.0001? What if it were 0.00001?  In the above Titanic example, one should think that it would be possible to calculate P(survival) from the mosaic plot, that is, from the conditional probabilities P(survival|class). That’s only true actually if one is given the probabilities of the classes, P(class). Then it works as follows (strictly speaking for relative frequencies, not probabilities): P(survival) = P(survival | 1st class) · P( 1st class) + P(survival | 2nd class) · P( 2nd class) + P(survival | 3rd class) · P( 3rd class) + P(survival | crew) · P( crew) Proof: The four terms on the right side equal = P(survival and 1st class) + P(survival and 2nd class) + P(survival and 3rd class) + P(survival and crew) These four events are disjoint and their union is simply “survival”. QED Numerically, this works out as follows: 0.3230 = 0.6246·0.1477 + 0.4140·0.1295 + 0.2521·0.3208 + 0.2395·0.4021 The numbers can be gleaned from the table below the mosaic plot in JMP. The above example can be restated in general terms as follows: P(A) = P(A|B1) · P(B1) + P(A|B2) · P(B2) + P(A|B3) · P(B3) + P(A|B4) · P(B4) For two conditioning events, which you illustrate with B = (1st class) and BC = (not 1st class), and A = (survival), this simplifies as follows: P(A) = P(A|B) · P(B) + P(A|BC) · P(BC) These are called marginalization formulas because they turn conditional probabilities of A into the marginal (= plain) probability of A. Here is an example from the Titanic with conditioning on sex: SURVIVED 1.00 ye s 0.75 0.50 no 0.25 0.00 fe m ale m ale SEX SEX By SURVIVED Count no Total % Col % Row % female 126 5.72 8.46 26.81 male 1364 61.97 91.54 78.80 1490 67.70 y e s 344 15.63 48.38 73.19 367 16.67 51.62 21.20 711 32.30 470 21.35 1731 78.65 2201 It takes some effort to decode the table on the right, but the four terms in the top left cell give the necessary clues: P(survival) = 0.3230 = P(survived|female) • P(female) + P(survived|male) • P(male) = 0.7319 • 0.2135 + 0.2120 • 0.7865  Here is a hairy but typical legal application that involves both Bayes’ rule and the probability summation of the preceding bullet. It shows how methods of detection with very small error probabilities can fail to be conclusive (from http://en.wikipedia.org/wiki/Bayes_rule): Assume a company is testing its employees for drugs, and assume the test is 99% accurate; also assume half a percent of employees take drugs. What is the probability of taking drugs given a positive test result? A first problem is to make sense of the term “99% accurate”. We take it to mean this: P(pos | drug) = 0.99 P(neg | no drug) = 0.99 true positives true negatives We infer the following, which is useful later: P(pos | no drug) = 0.01 P(neg | drug) = 0.01 false positives false negatives Then we know this: P(drug) = 0.005 P(no drug) = 0.995 marginal prob of drug use marginal prob of no drug use The question is this: P(drug | pos) = ? prob of being correct given a positive Solution: P(drug | pos) = P(drug and pos) / P(pos) The numerator is easy: P(drug and pos) = P(pos | drug) · P(drug) = 0.99 · 0.005 = 0.00495 The denominator is messy and requires collecting the pieces with the marginalization formula: P(pos) = P(pos | drug) · P(drug) + P(pos | no drug) · P(no drug) = 0.99 · 0.005 + 0.01 · 0.995 = 0.0149 marginal prob of a positive Finally: P(drug | pos) = P(drug and pos) / P(pos) = 0.00495 / 0.0149 = 0.3322148 Now that’s a disappointment! The probability of catching a drug user with a positive drug test is only 1 in 3!!! Two thirds of the positives are lawabiding citizens!!! Intuitively, what is going on? The jist is this: In order to catch a small minority of half a percent (the drug users), the detection method must have an error rate of much less than half a percent. The rate of false positives in the example is one percent, hence out of the majority of 99.5% non-drug users the test finds almost 1% false positives, which is large compared to the 0.5% actual drug users. Sure, the 0.5% drug users get reliably detected, but the true positives get swamped by the false positives…  Based on the above example, what do you think about a DNA test whose error rate is one in a million when used in a murder case? What would happen if the DNA test were administered to all adults in this country? Independent events (Book: p.113)  Definition: Two events A and B are said to be independent if P(A and B) = P(A) · P(B) Note: This is a definition! Some pairs of events will be independent, others won’t! Examples: 1) A = (head now) and B = (head next) are usually considered independent events, unless someone cheats. 2) Whether a rise in the S&P on any day is independent from a rise the previous day cannot be known without doing data analysis. Most likely one will find some deviation from independence, although it might be small.  Independence in terms of conditional probability: If P(B) ≠ 0, then A and B are independent if and only if P(A|B) = P(A) This makes intuitive sense: Knowing that B occurred does not give us any information about the frequency of A occurring.  What would this mean for the relative frequencies of survival and 1st class? It means that P(survival | 1st class ) = P(survival). In other words, it doesn’t matter whether one was in 1st class or not; the survival probability is the same. (This is only a thought experiment, not the truth!)  What does it mean for two flips of a coin? P(head now | head before) = P(head) That is, any coin flip is independent of the preceding coin flip. We usually assume that coin flips are “independent”, that is, no one is rigging the flip such that a head is more or less likely next time. (The idea that observing many consecutive heads makes a tail more likely next time is a typical form of magical thinking and applied by many people when playing the lottery: “I’m getting close to winning the lottery any time now…”)  Source of confusion: “Independent” is not the same as “disjoint”. In fact, two disjoint events A and B with positive probability are never independent: P(A and B) = P(Ø) = 0 ≠ P(A) · P(B) > 0 Intuitively, knowing that B occurred gives a lot of knowledge about the frequency of A occuring: zero! Example: A = (spades), B = (heart).  The notion of independence shows the advantage of using idealized probabilities as opposed to empirical relative frequencies. In any finite series of pairs of coin flips, the product condition will rarely be satisfied. Yet in the limit, for infinitely many pairs of coin flips, the product formula should hold.  The notion of more than two independent events is a little messy: We say A1, A2, …, An are independent, if product formulas for all subsets of all sizes. For three events we ask that all of the following hold: P(A1 and A2) = P(A1) · P(A2) P(A1 and A3) = P(A1) · P(A3) P(A2 and A3) = P(A2) · P(A3) P(A1 and A2 and A3) = P(A1) · P(A2) · P(A3) For n events, we’d have close to 2n conditions (2n – n – 1, to be exact), the last one being: P(A1 and A2 and … An ) = P(A1) · P(A2) · … · P(An) Example: The probability of 10 heads in 10 coin flips is 0.510 = 1/1024 ≈ 0.001. Random Variables (Book Chap. 5)  A random variable is a random outcome that is a number. Random variables are written as X, Y, Z,… (italicized letters from the end of the alphabet; for random events we use roman letters from the beginning.)  Compare: o A random event is a random binary outcome , “yes” or “no”. o A random variable is a random number outcome.  Examples: o The number heads in 10 flips of a coin o The number of Aces in a hand after dealing cards from a wellshuffled deck o The number of days with an S&P loss in a year o The trading volume (number of shares traded) on the NY Stock Exchange in a given day o A daily return on Google stock [return = (price today – price yesterday)/(price yesterday) rt = (pt – pt-1) / pt-1 , often expressed as a percent: rt · 100%.] o The number of takers of your company’s service offering after a telemarketing campaign o Income of any household in a random sample of households o Average household income of a random sample of households o GPA of any student in a random sample of students o Average of GPAs across students in a random sample o Daily temperature changes in a given location  Characteristics: Like random events, random variables incorporate the ideas of unpredictable outcomes that are independently repeatable, at least in principle.  Connection between random variables and random events: o Random variables can be used to define random events such as  A = (number of heads in 10 coin flips ≤ 5)  A = (today’s Google return ≤ – 0.03)  A = (today’s daily temperature change ≤ – 5F° ) o So-called dummy variables can be formed by assigning 1 and 0 to “yes” and “no” of a random event: X = 1 if A = “yes” 0 if A = “no”  Types of random variables: o Discrete random variables: There are only finitely many possible values that X can produce as outcomes, or else the outcomes are integers. o Continuous random variables: The possible outcomes can be anywhere on the real number line or in an interval thereof. Exercise: Classify the above examples according to discrete and continuous. Expected Values of Random Variables (Book p.140ff)  Motivation: If a random variable X is drawn N times, that is, if we observe X1, X2, X3, X4,…, XN, then the mean is 1 ·( X1 +X2 + X3 + … + XN ) . N We are now going to show that this mean has a limit as N → ∞. To this end, consider a simplified way of calculating the mean, assuming N is large and also assuming X takes on only few possible values x1,…,xC. For concreteness, keep in mind the example of N rolls of a die, so that the possible values are 1,2,3,4,5,6. Now we can bundle the observed outcomes X1, X2, X3, X4,…, XN by simply counting the frequencies of the possible values: n1 = #{Xi = x1 , i = 1,…,N } n2 = #{Xi = x2 , i = 1,…,N } … nC = #{Xi = xC , i = 1,…,N } For the rolls of a die, this would be the six frequencies of the values 1,2,3,4,5,6. Note that n1 + n2 +…+ nC = N. We can rewrite the sum as follows: X1 +X2 + X3 + … + XN = x1 · n1 + x2 · n2 + … + xC · nC The relative frequencies are fi = ni / N and note that 1 ·( X1 +X2 + X3 +… + XN ) = x1 · f1 + x2 · f2 + … + xC · fC N If you don’t grasp this formula, look at “Sim Dice and Coin Flips.JMP”, do Tables > Sort > (select ‘Throws of Dice’) > OK. Then think about how to simply calculate the mean: multiply with 1, 2,…, 6 with how many times you see 1s, 2s,…,6s, then divide by N=20. Well, ‘how many times’/20 are just the relative frequencies f1, f2,…, f6. For example, last time I checked, the numbers were ( 1· 3 + 2· 0 + 3· 7 + 4· 3 + 5· 3 + 6· 4 ) / 20 = 3.75, close to 3.5 In the limit for N → ∞ the relative frequencies will go to the respective probabilities: fi → pi (= 1/6 for rolling a fair die) Hence, again in the limit, we have: 1 · ( X1 +X2 + X3 + X4 +… + XN ) N → x1 · P(X=x1) + x2 · P(X=x2) + … + xC · P(X=xC) This is called the law of large numbers. It is really just a glorification of the convergence of relative frequencies to probabilities: fi → pi. It also motivates the following…  Definition of the expected value of a discrete random variable X: E(X) = x1 · P(X=x1) + x2 · P(X=x2) + x3 · P(X=x3) + … where x1, x2, x3,… are the possible outcomes of X. This can in principle be an infinite sum if C=∞. Note that each Ai = (X = xi) is a random event and hence is assumed to have a probability. (Expected values of continuous random variables will be discussed later.) Alternate notation: the Greek letter μ if it is understood which random variable is meant; if not, one may write μX instead. E(X) = μ = μX  Examples: o X = 1 for (head) and 0 for (tail) in fair coin flipping: E(X) = 1 · ½ + 0 · ½ = 0.5 o X = 1,…, 6 when throwing a fair die: E(X) = 1 · 1/6 + 2 · 1/6 + … + 6 · 1/6 = 3.5  Interpretation: o The expected value of X is the probability-weighted average of its outcomes. (That’s just stating the formula in words.) o The expected value of X is the long-run average of outcomes of X when X is drawn repeatedly and independently. In a simplifying way, we can think of the expected value as “the mean if we had an infinite amount of data”. Therefore the expected value is to the mean what the probability is to the relative frequency.  Special case: If we take a dummy random variable X of a random event A (X = 1 when A is observed, 0 else), then: E(X) = P(A) .  Experiment: Look at “Sim Dice and Coin Flips.JMP” again, which has a formula for simulating the throw of 20 dice, or throwing a die 20 times. Do Distribution on this column to see what its mean is. It will be close but not very close to 3.5. Then increase the number of rows (Rows > Add Rows…), for example, to 100,000. Now the mean will be close to 3.5. Note also how the bars of the histogram become much more equal, illustrating the convergence of the relative frequencies of 1, 2, …, 6 to the probabilities 1/6. ― By the way, the same can be done for the flips of a coin where X=1 for head and 0 for tail.  Source of confusion: The term “expected value” is unfortunate. It suggests that this is a value one expects to observe. But it is often the case that the expected value is not among the possible outcome values of the random variable. Example: X = outcome of a fair die. The expected value is E(X) = 3.5, but the possible values are only the integers from 1 to 6. Same for X = 1/0 for head/tail in a coin flip. Properties of the Expected Value Operation  Sums and multiples of random variables: In what follows, we need to have a sense in which we can add two random variables and multiply them with numbers. Let’s explain by example: o Imagine a game in which we flip a coin and roll a die. We say the total outcome is what the die shows, plus 1 if the coin shows head. This can obviously be described as forming a new random variable Z by adding up the outcomes of two other random variables X (roll of the die) and Y (coin flip, head=1). o A real-life example is the addition of stock holdings when forming investment portfolios. One thinks of the prices as random variables (actually: “random processes” because they keep changing over time). Multiplication of a random variable with a constant occurs when we convert currency. For example, X = Google stock price in $, might need to be converted to €, hence the converted random variable might be something like Y = 0.75 · X.  Linearity property of expected values under addition and multiplication: E( X + Y ) = E(X) + E(Y) E( c X ) = c E(X) Other simple properties follow from it: E( X + a ) = E(X) + a E( a X + b Y ) = a E(X) + b E(Y) This last property alone is so powerful that it is equivalent to the first two properties. The math proof is just a grinding piece of algebra, starting with the definition applied to E(X+Y) and E(cX), respectively. The intuitions are simple, though: Forming the mean of a portfolio of two stocks should be possible by adding the means of the two stocks. In addition, converting a stock price series from $ to € and taking the mean should be the same as converting the mean from $ to € . Population Variance and Standard Deviation (Book p.142ff)  If the mean has a counterpart for N → ∞, then so will the standard deviation, and the variance (= s2). Definition: Abbreviating μ = E(X), one defines V(X) = (x1–μ)2 · P(X=x1) + (x2– μ)2 · P(X=x2) + … or, equivalently: V(X) = E( (X – μ)2 ) Either way, the variance is “the expected value of the squared deviation from the expected value of X.” This is of similar to the variance for finite data, which was “the average squared deviation from the mean.” Alternate notation: Greek letter σ2 = V(X) Convention for any type of parameter (location, dispersion,…): o For the “population” (N → ∞) one uses Greek letters such as σ2. o For the “sample” one uses Roman letters s2.  Limit: s2 → σ2 as N → ∞ , just as x → µ . This is another application of the “law of large numbers”.  Properties of variances: V(cX) = c2 V(X) and V(X+c) = V(X).  The population standard deviation is defined as σ = V(X) ½ . This is a measure of dispersion for the “population” (= infinite amount of data): σ(cX) = | c| σ(X) and σ(X+c) = σ(X). Recall: If one stretches an object by a factor 2 (=c), the width growths by a factor 2, and if one moves an object, its width stays the same.  Examples: As usual, stylized coin tossing and rolling of dice are most accessible. o Biased coin: head → X=1, tail → X = 0, P(X=1) = p E(X) = 1· p + 0· (1–p) = p V(X) = (1– p)2 · p + (0 – p)2 · (1–p) = p(1–p) σ(X) = (p(1–p)) ½ For p = ½ we get V(X) = ¼ and σ(X) = ½ . o Fair die: E(X) = 3.5 V(X) = (1–3.5)2 /6 + (2–3.5)2 /6 + (3–3.5)2 /6 + (4–3.5)2 /6 + (5–3.5)2 /6 + (6 –3.5)2 /6 = 2.916667 σ(X) = 1.707825 (Nothing pretty about these numbers.)  JMP Experiments: These dry calculations can again be illustrated with “Monte Carlo” or simulation experiments. The larger N, the closer the estimates are to the theoretical population values. Again, go to “Sim Dice and Coin Flips.JMP”, add for example 999,980 rows to the existing 20 for a full million, then do Distribution on both variables, and you’ll get not only means but standard deviations for fair coin flips and throws of dice. You can edit the formula for the coin flip column and replace ‘0.5’ with another value between 0 and 1 to see how an unfair/biased coin flips.  Limit: s → σ as N → ∞ , which is due to the continuity of the square root operation since we knew this to be true for the variance.  Importance of variance: We mentioned in Module 2 that the sample variance is of importance in finance as a measure of volatility of investments. Actually, the population variance is of even greater importance because it lends itself to wonderful algebra for optimizing portfolios (investment mixes) by maximizing return subject to a bound on the volatility/variance ― the famous Markowitz Theory. If we denote by R = w1 R1 + w2 R2 + w3 R3, the hypothetical returns of a stock portfolio that is a mix of three stocks with returns R1, R2, R3 with proportions w1, w2, w3, the theory would try to maximize E(R) subject to V(R) ≤ c by adjusting the proportions suitably. Population Covariance and Correlation  Motivation: We have seen that many statistically relevant quantities computed from sampled data have limits if the sample size N goes to infinity: relative frequency → probability sample mean → expected value (population mean) sample variance → population variance sample standard deviation → population standard deviation The same holds for covariances and correlations computed from samples: in the limit, they approach the population covariances and population correlations. These will measure the strength of linear association for an infinity of data.  Bivariate probability: For population covariance to make sense, one has to have a notion of joint probability P(X=xi, Y=yj). The reason is that we will need to look at products such as X·Y, for example, where the possible values are xi · yj. Their probabilities are obviously P(X=xi, Y=yj). An expected value E(X·Y), for example, is then the sum over all values xi · yj multiplied with their probabilities.  Population covariance: For two random variables X and Y that can be observed together, abbreviate µX = E(X) and µY = E(Y) and define C(X,Y) = E((X–μX) · (Y–μY)) . If we write this as a probability-weighted sum, it is a little messy because we have to sum over all possible combinations of values taken on by X and Y: C(X,Y) = (x1–μX) · (y1–μY) P(X=x1, Y=y1) + (x2–μX) · (y1–μY) P(X=x2, Y=y1) + (x1–μX) · (y2–μY) P(X=x1, Y=y2) + (x2–μX) · (y2–μY) P(X=x2, Y=y2) + … Important is that the covariance is linear in both of its arguments: C(X+Y,Z) = C(X,Z) + C(Y,Z) C(aX,Y) = a C(X,Y) C(X,Y+Z) = C(X,Y) + C(X,Z) C(X,bY) = bC(X,Y) These formulas are at the root of some simple but miraculous calculations that will show us at what rate relative frequencies and means approach their limits, the probabilities and expected values, as N → ∞. In other words, these calculations will show how the precision of relative frequencies and means increases as the sample size increases. For now note the following: o C(X,X) = V(X) o sample covariance → population covariance as N → ∞  Population Correlation: C(X,Y) c(X,Y) = ――――― σ(X) σ(Y) Everything we know about sample correlation carries over to the population correlation: –1 ≤ c(X,Y) ≤ +1 c(X,Y) = +1 c(X,Y) = –1 if Y = aX + b with a > 0 if Y = aX + b with a < 0 Again: sample correlation → population correlation as N → ∞ Continuous versus Discrete Probability  Sofar we have worked with probability on finite sample spaces Ω, but as the last example of portfolio applications show, numeric outcomes such as investment returns do not usually form a finite or otherwise discrete set of numbers. Instead, it is more natural to think of them as outcomes that can theoretically be any real number. Real numbers: any number with arbitrarily many decimals, possibly infinitely many decimals. Why infinitely many decimals? So we won’t have problems exponentiating, taking logarithms, powers, roots,….  This thought makes it desirable to introduce continuous random variables. A problem we encounter quickly is that any continuous numeric outcome should be thought of having zero probability, assuming we could observe it to infinite precision: P(X = 0.0378736…) = 0, for example. Only non-trivial intervals such as (0.0378… ≤ X ≤ 0.101009…) would have non-zero probabilities. The idea is then to describe such probabilities with integrals or areas under so-called density functions f(x):  P(a ≤ X ≤ b) = Required properties: f(x) ≥ 0 and b a    f ( x ) dx f ( x ) dx = 1 Remember: The integral is the area under the curve from the lower to the upper limit. Intuition: Think of density functions as idealized histograms obtained when N → ∞ and the bin width goes to zero.  Examples of density functions: o The density function of a ‘uniformly distributed’ random variable in the interval (0,1): f(x) = 1 for 0 ≤ x ≤ 1 and f(x) = 0 otherwise. For example, for a random variable that is uniformly distributed in the unit interval (c=0, d=1), the density function is f(x) = 1 for 0 ≤ x ≤ 1 and f(x) = 0 otherwise. o The density function a random variable with a ‘pyramid distribution’ is f(x) = min(1+x,1–x) for –1 ≤ x ≤ 1 and f(x) = 0 otherwise. o The density function an ‘exponentially distributed’ random variable is f(x) = e x for 0 ≤ x and f(x) = 0 otherwise. o The density function of a (standard) normally distributed random variable is x2 f(x) = 1 2 e 2 This is the famous “bellcurve”. There is some deep math behind this function, and we will mention some later (“central limit theorem”). The distracting factor 1/(2π)1/2 is there to make the integral equal to 1 over the whole real number line. All four density functions are illustrated in the figure below. In all four cases the total area below the curve is 1, as it should be. JMP simulation of a uniformly distributed random variable: Go to the JMP table Sim Uniform and Normal.JMP and look at the Distribution of the two columns. You can add more rows and see the law of large numbers at work again: The histograms will approach the shapes of the above curves. For 10,000 cases one sees a reasonably good likeness: .00 .20 .40 .60 .80 1.00 -4.0 -2.0 .0 1.02.03.0 (One can tell JMP to plot the histograms horizontally with (red diamond) > Histogram Options > Vertical to remove the check mark.) Normally Distributed Random Variables (Book: Sec. 6.3)  In nature and in the human world, many types of measurements and quantitative observations are approximately normally distributed, meaning their histograms look like approximations to the bellcurve, if one allows it to be shifted and stretched. In order to shift and stretch it, we first have to find out what its location and dispersion parameters are because “shifting” means changing the location and “stretching” means changing the dispersion. Miraculously, these are the expected value (idealized mean) and population standard deviation: E(X) = 0 and σ (X) = 1 . The first is rather obvious because the bellcurve is symmetric about zero, but the second requires an integration by parts, which we will gladly skip. Because of the zero mean and unit SD property, this is called the “standard normal distribution”, as opposed to the following:  We now need normally distributed random variables with arbitrary expected values and population standard deviations. Here is a trick for stretching and shifting a random variable with zero population mean (=expected value) and unit population standard deviation to another random variable that has population mean µ and population SD σ: Y = σX+µ Following the rules of linearity and the rules for variances, we find E(Y) = µ and V(Y) = σ2 . Conversely, if we are given a random variable Y with these properties, then Z = Y   will have zero population mean and unit population SD. There is a reason for calling this new random variable ‘Z’: It is what we may call the “z-score random variable” because it does for N → ∞ what forming z-scores does for columns of datasets. (Book p.147.) Definition: A random variable Y is said to be normally distributed with population mean µ and population SD σ if its z-score Z variable is normally distributed with population mean 0 and population SD 1. One writes shorthand Y ~ N(µ, σ2). Obviously, Z~N(0,1). The formula for the density function of Y is not very insightful, but here is: f(y) =  1 e 2  ( y   )2 2 2  The empirical rule for the normal distribution: If Y ~ N(µ, σ2), o the probability of observing a value from Y in the interval (µ–σ, µ+σ) is 0.6826895 or about 2/3. “About 2 out of three observations fall within one SD of the mean.” o the probability of observing a value from Y in the interval (µ–2σ, µ+2σ) is 0.9544997 or about 95%. “About 19 out of 20 observations fall within two SDs of the mean.” o the probability of observing a value from Y in the interval (µ–3σ, µ+3σ) is 0.9973002 or about 99.7%. “About 997 out of 1000 observations fall within three SDs of the mean.” Now we have a better concept of what “residuals in the order of RMSE” means: If the RMSE is an estimate of the residual SD and if the residuals are approximately normally distributed, we’d expect about 2/3 of the residuals within ±RMSE, about 95% within ±2RMSE, and “almost all” within ±3RMSE. Recall that residuals always have zero mean. Using the empirical rule for residuals does assume, however, that the residuals are approximately normally distributed, which they not always are. How do we ever know whether a batch of numbers is approximately normally distributed? Good question. A natural suggestion: look at the histogram. There is a better graphical method, though, and we will look into it next. The normal distribution is practically and theoretically so important that people devised a more sensitive method than the histogram.  The normal quantile plot or normal Q-Q plot: The idea is to compare the quantiles of a column of numbers with the quantiles of a theoretical normal distribution. (Book: p. 205ff) Yes, not only means and SDs but quantiles as well exist for “populations”. Again, population quantiles are the limits of sample quantiles as N → ∞. The theoretical 30% quantile of the standard normal distribution, for example, is –0.5244005… It has the property that the area under the bellcurve from –∞ to –0.5244005… is 0.30. Here are examples of normal Q-Q plots generated in JMP: .01 3 .05.10 .25 .50 .75 .90.95 .99 2 1 0 -1 -2 -3 -3 -2 -1 0 1 2 3 Nor m al Quantile Plot This is a fine plot. If the data are approximately normally distributed, the black dots remain within the “double funnel” of dashed red curves. If the dots go outside, we have evidence against the assumption of approximately normally distributed data. An example: the prices of the Honda Accord data. .01 .05.10 .25 .50 .75 .90.95 .99 20000 15000 10000 5000 0 -3 -2 -1 0 1 2 3 Nor m al Quantile Plot The skewness is an obvious violation because normally distributed data are approximately symmetrically distributed about the mean. Here is another example which fails normality due to discreteness: 100 rolls of a fair die. The six possible values cause plateaux in the normal quantile plot that reach outside the double funnel. You can create this plot by adding 80 rows to “Sim Dice and Coin Flips.JMP”. .01 6 .05 .10 .25 .50 .75 .90 .95 .99 5 4 3 2 1 -3 -2 -1 0 1 2 3 Nor m al Quantile Plot JMP: You can add the Q-Q plots by clicking the red diamond of the histogram and select “Normal Quantile Plot”, which is the third item down. Homework 5 will ask you to set the JMP preferences such that a normal Q-Q plot is provided by default whenever you do Distribution of a quantitative variable.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Module 6 - Wharton Statistics