Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Oct. 17 Statistic for the Day: In1996, the percentages of 16-24 yr old high school finishers enrolled in college were 49% for lower income families 63% for middle income families 78% for higher income families Assignment: Review for Exam #2, Wednesday, Oct. 19 Chapters 10, 11, 12, 13, 16 Arby’s sandwiches weight 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Big Montana Giant Roast Beef Regular Roast Beef Beef ‘n Cheddar Super Roast Beef Junior Roast Beef Chicken Breast Fillet Chicken Bacon ‘n Swiss Roast Chicken Club Market Fresh Turkey Ranch Bacon Market Fresh Ultimate BLT Market Fresh Roast Beef Swiss Market Fresh Roast Ham Swiss Market Fresh Roast Turkey Swiss Market Fresh Chicken Salad 309 g 224 154 195 230 125 233 209 228 379 293 357 357 357 322 calories 590 450 320 440 440 270 500 550 470 830 780 780 700 720 770 Arby's Sandwiches 300 400 calories 500 600 700 800 This type of plot, with two measurements per subject, is called a scatterplot (see p. 166). 150 200 250 weight 300 350 800 Arby's Sandwiches 400 calories 500 600 700 The correlation measures the strength of the linear relationship between weight and calories. 300 Correlation = 0.94 150 200 250 weight 300 350 More on this in the next class. 800 Arby's Sandwiches calories 500 600 700 The best-fitting line through the data is called the regression line. 300 400 How should we describe this line? 150 200 250 weight 300 350 Arby's Sandwiches 400 calories 500 600 700 800 The intercept is 18 in this case and the slope is 2.1. 300 cal = 18 + (2.1)(wt) 150 200 250 weight 300 350 In this class, you don’t need to know how to calculate the slope and intercept (but see p. 195 if you like formulas). intercept slope calories = 18 + (2.1)(weight in grams) ------------------------------------------------For example, if you have a 200g sandwich, on the average you expect to get about: 18 + (2.1)(200) = 18 + 420 = 438 calories -------------------------------------------------- For a 350g sandwich: 18 + (2.1)(350) = 18 + 735 = 753 calories intercept slope calories = 18 + (2.1)(weight in grams) For every extra gram of weight, you expect an increase of 2.1 calories in your Arby’s sandwich. Interpretation of slope: Expected increase in response for every unit increase (increase of one) in explanatory. Facts about Correlation: +1 means perfect increasing linear relationship -1 means perfect decreasing linear relationship 0 means no linear relationship + means increasing together - means one increases and the other decreases Strength vs. statistical significance Even a weak relationship can be statistically significant (if it is based on a large sample) Even a strong relationship can be statistically insignificant (if it is based on a small sample) Regression potential pitfalls: Sometimes we see strong relationship in absurd examples; two seemingly unrelated variables have a high correlation. This signals the presence of a third variable that is highly correlated with the other two (confounding). Remember that correlation does not imply causation. Also: If you use a regression for prediction, do not extrapolate too far beyond the range of the observed data. Vocabulary vs Shoe Size Regression Plot Y = -806 + 555 X Correlation = .985 2500 Words known 2000 1500 1000 500 0 2 3 4 Shoe Size 5 6 Outliers Outliers are data that are not compatible with the bulk of the data. They show up in graphical displays as detached or stray points. Sometimes they indicate errors in data input. Some experts estimate that roughly 5% of all data entered is in error. Sometimes they are the most important data points. Put Options (NYTimes, September 26, 2001) Put options on stocks give buyers the right to sell stock at a specified price during a certain time. They rise in value if the underlying stock falls below the strike price. The value of puts on airline stocks soared on Sept. 17 when U.S. stock and options markets reopened after a four-day closure, as airline stocks slid as much as 40 percent. American Airlines was at $32 prior to attack. Suppose a terrorist buys a put option (at say $5 per share) to have the right to sell at $25. The price after the attack was at $16. That put option is now more valuable. R wins machine (D minus R negative for machine) D wins absentee (D minus R positive for absentee) From story on p. 442 Regression Plot absentee = -182.575 + 0.295319 machine - 0.0000285 machine**2 S = 294.363 R-Sq = 62.0 % R-Sq(adj) = 57.8 % absentee 1000 0 Regression -1000 95% PI 0 2500 machine 5000 7500 Outliers affect regression lines and correlation (these data aren’t real): 4.0 Exercise minutes vs. GPA Red line: B 3.0 3.5 Without A, with B 2.5 Black line: 1.5 2.0 With A and B Green line: 1.0 GPA A Without A or B 0 2000 4000 6000 Exercise 8000 10000 Two categorical variables: Explanatory variable: Sex Response variable: Body Pierced or Not Survey question: Have you pierced any other part of your body? (Except for ears) Research Question: Is there a significant difference between women and men at PSU in terms of body pierces? Data: Explanatory: Sex Response: Body Pierced? No Yes Women 86 52 138 Men 77 5 82 163 57 220 From STAT 100, fall 2005 (missing responses omitted) Percentages Response: no body pierced? yes All female male 62.32% 93.90% 37.68% 6.10% 100.00% 100.00% All 74.09% 25.91% 100.00% 62.32% = 86 / 138 93.90% = 77 / 82 Research question: Is there a significant difference Between women and men? (i.e., between 66.67% and 91.35%) The Debate: The research advocate claims that there is a significant difference. The skeptic claims there is no real difference. The data differences simply happen by chance, since we’ve selected a random sample. The strategy for determining statistical significance: First, figure out what you expect to see if there is no difference between females and males Second, figure out how far the data is from what is expected. Third, decide if the distance in the second step is large. Fourth, if large then claim there is a statistically significant difference. Exercise: Follow the 4 steps and answer the Research Question: Is there a statistically significant difference between males and females in terms of the percent who have used marijuana? Data from STAT 100 fall 2005 Rows: Sex Female Male All Columns: Marijuana No Yes All 56 31 87 76 46 122 132 77 209 Step 1: Find expected counts if the skeptic is correct This step is based on the marginal totals: No Yes Women A B 132 Men C D 77 87 122 209 132 87 54.95 A = 209 (Repeat for B, C, D) Step 1 cont’d Repeat the process for B (and then C and D): No Yes Women 54.95 B 132 Men C D 77 87 122 209 132 122 77.05 B = 209 Or you can simply subtract: 132 – 54.95 = 77.05 Step 1 cont’d Green: Observed counts Red: Expected counts if skeptic is correct. Female Male Total Marijuana? No Yes 56 76 54.95 77.05 All 132 132.00 31 32.05 46 44.95 77 77.00 87 122 209 Step 2: How far are the data (observed counts) from what is expected? Green: Observed counts Red: Expected counts if skeptic is correct. (56 54.95) .020 54.95 (76 77.05)2 .014 77.05 (31 32.05) 2 .034 32.05 (46 44.95)2 .025 44.95 2 Chi-Sq = 0.020 + 0.034 + 0.014 + 0.025 = 0.093 Step 3: Is the distance in step 2 large? 2.0 Chi-squared distribution with 1 degree of freedom: If chi-squared statistic is larger than 3.84, it is declared large and the research advocate wins. 1.0 1.5 Something is large when it is in the outer 5% tail of the appropriate distribution. Our chi-squared value: 0.5 95% on this side 5% on this side 0.0 0.093 (from Step 2) Cutoff=3.84 0 1 2 3 4 5 6 Step 4: If distance is large, claim statistically significant difference. Rows: Sex Female Male Columns: marijuana No 56 42.4% Yes 76 57.6% All 132 100.0% 31 40.3% 46 59.7% 77 100.0% Hence, the difference: 57.6% of women versus 59.7% of men is not statistically significant in this case. (Sample size has been automatically considered!) How many degrees of freedom here? Women Too Young No One df Two df Yes 135 Men 81 69 35 112 216 Degrees of freedom (df) always equal (Number of rows – 1) × (Number of columns – 1) Health studies and risk Research question: Do strong electromagnetic fields cause cancer? 50 dogs randomly split into two groups: no field, yes field The response is whether they get lymphoma. Rows: mag field no yes All Columns: cancer no yes All 20 10 5 15 25 25 30 20 50 Terminology and jargon: In the mag field group, 15/25 of the dogs got cancer. Therefore, the following are all equivalent: 1. 60% of the dogs in this group got cancer. 2. The proportion of dogs in this group that got cancer is 0.6. 3. The probability that a dog in this group got cancer is 0.6. 4. The risk of cancer in this group is 0.6 And one more: The odds of cancer in this group are 3/2. More terminology and jargon: 1. Identify the ‘bad’ response category: In this example, cancer 2. Treatment risk: 15 / 25 or .60 or 60% 3. Baseline risk: 5 / 25 or .20 or 20% 4. Relative risk: Treatment risk over Baseline risk = .60 / .20=3 That is, the treatment risk is three times as large as the baseline risk. 5. Increased risk: By how much does the risk increase for treatment as compared to control? (.60 - .20) / .20 = 2 or 200% That is, the risk is 200% higher in the treatment group. 6. Odds ratio: Ratio of treatment odds to baseline odds. (15/10) / (5/20) turns out to be 6. That is, the treatment odds are six times as large as the baseline odds. Final note: When the chi-squared test is statistically significant then it makes sense to compute the various risk statements. If there is no statistical significance then the skeptic wins. There is no evidence in the data for differences in risk for the categories of the explanatory variable. Recall marijuana example Female Male Marijuana? No Yes 56 76 54.95 77.05 All 132 132.00 31 32.05 46 44.95 77 77.00 87 122 209 Total Chi-Sq = 0.020 + 0.034 + 0.014 + 0.025 = 0.093 SO THE SKEPTIC WINS. But what if we observed a much larger sample? Say, 100 times larger? Marijuana example, larger sample: Female Marijuana? No Yes 5600 7600 5495 7705 All 13200 13200 Male 3100 3205 4600 4495 7700 7700 Total 8700 12200 20900 Chi-Sq = 2.0 + 3.4 + 1.4 + 2.5 = 9.3 NOW THE RESEARCH ADVOCATE WINS. Practical significance In the marijuana example, 58% of women and 60% of men reported that they had tried marijuana. This size of difference, even if it is really in the population, is probably uninteresting. Yet we have seen that a large sample size can make it statistically significant. Hence, in the interpretation of statistical significance, we should also address the issue of practical significance. In other words, we should answer the skeptic’s second question: WHO CARES? Simpson’s paradox (for quantitative variables) 80 Price vs. # of pages for 15 books 40 Correlation= -.312 20 price 60 Example 11.4, pp. 204-205 100 200 300 400 pages 500 600 Simpson’s paradox (for quantitative variables) 80 Price vs. # of pages for 15 books H 60 Example 11.4, pp. 204-205 H Correlation= -.312 H 40 H H H H Correlation= .348 H 20 price H S Correlation= .637 S 100 200 SS 300 S S 400 pages S S 500 600 Simpson’s paradox for categorical variables, as seen in video Overall admitted to City U. Number Percent Men 198 / 360 55% Women 88 / 200 44% Business (hard) Law (easy) Number Percent Number Percent Men 18 / 120 15% Men 180 / 240 75% Women 24 / 120 20% Women 64 / 80 80% Women better in each, but more men apply to easier law school! Rules: For combining probabilities 0 < Probability < 1 1. If there are only two possible outcomes, then their probabilities must sum to 1. 2. If two events cannot happen at the same time, they are called mutually exclusive. The probability of at least one happening (one or the other) is the sum of their probabilities. [Rule 1 is a special case of this.] 3. If two events do not influence each other, they are called independent. The probability that they happen at the same time is the product of their probabilities. 4. If the occurrence of one event forces the occurrence of another event, then the probability of the second event is always at least as large as the probability of the first event. Rule 1: If there are only two possible outcomes, then their probabilities must sum to 1. According to Example 3, page 302: P(lost luggage) = 1/176 = .0057 Thus, P(luggage not lost) = 1 – 1/176 = 175/176 = .9943 The point of rule 1 is that P(lost) + P(not lost) = 1 so if we know P(lost), then we can find P(not lost). Sounds simple, right? It can be surprisingly powerful. Rule 2: If two events cannot happen at the same time, they are called mutually exclusive. In this case, the probability of at least one happening is the sum of their probabilities. [Rule 1 is a special case of this.] Example 5, page 303: Suppose P(A in stat) = .50 and P(B in stat) = .30. Then P( A or B in stat) = .50 + .30 = .80 Note that the events ‘A in stat’ and ‘B in stat’ are mutually exclusive. Do you see why? Rule 3: If two events do not influence each other, they are called independent. In this case, the probability that they happen at the same time is the product of their probabilities. Example 8, page 303: Suppose you believe that P(A in stat) = .5 and P(A in history) = .6. Further, you believe that the two events are independent, so that they do not influence each other. Is this a Then P(A in stat and A in history) = (.5)×(.6) = .3 reasonable assumption? Rule 4: If the occurrence of one event forces the occurrence of another event, then the probability of the second event is always at least as large as the probability of the first event. If event A forces event B to occur, then P(A) < P(B) Special case: P(E and F) < P(E) P(E and F) < P(F) (because ‘E and F’ forces E to occur). Two laws (only one of them valid): Law of large numbers: Over the long haul, we expect about 50% heads (this is true). “Law of small numbers”: If we’ve seen a lot of tails in a row, we’re more likely to see heads on the next flip (this is completely bogus). Remember: The law of large numbers OVERWHELMS; it does not COMPENSATE. The game of Odd Man Consider the “odd man” game. Three people at lunch toss a coin. The odd man has to pay the bill. You are the odd man if you get a head and the other two have tails or if you get a tail and the other two have heads. Notice that there will not always be an odd man – this occurs if flips come up HHH or TTT. P(no odd man) = P(HHH or TTT) = P(HHH) + P(TTT) since HHH, TTT are mutually exclusive = (1/2)3 + (1/2)3 since H,H,H are independent (as are T,T,T) =1/8 + 1/8 = .25 Thus, P(there is an odd man) = 1 – P(no odd man) = 1 - .25 = .75 Play until there is an odd man. What is the probability this will take exactly three tries? P(odd man occurs on the third try) = P(miss, miss, hit) in that order! That’s the only way. (See why?) = P(miss) P(miss) P(hit) since each try is independent of the others. = [P(miss)]2 P(hit) = [.25]2 .75 = .047 This is the final answer: The probability that the odd man occurs exactly on the third try (after two unsuccessful tries). Expectation What if you bet $10 on a game of craps? What is your expected profit? (Probability of winning: 244/495, or 49.3%) You win $10 with probability .493 You lose $10 with probability .507 Expected profit: .493($10) + .507(-$10) = - $0.14 Casino winnings, 10,000 games per day 200 Casino winnings for 1000 days 100 50 0 Frequency 150 Expectation = $1400 -2000 0 1000 2000 3000 4000 Casino winnings, 100,000 games a day 250 Casino winnings for 1000 days 50 100 150 Note: Now all values are positive 0 Frequency 200 Expectation = $14,000 5000 10000 15000 20000 25000 Your winnings, a single game We already calculated the expectation to be 14 cents. But you can’t lose 14 cents in one game; you either win 10 dollars or lose 10 dollars. Thus, the expected value does not have to be a possible value for any individual case.