Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 10: Relationships between (b/w) two measurement or quantitative variables These variables that involve measuring or counting. E.g. age, weight, height, quiz averages related to exam score Deterministic Relationship: we have a perfect correspondence b/w the two variables. E.g. Centimeters to inches Statistical Relationship: Four students have 92.2% average on quizzes. Exam scores were 73.3%, 86.7%, 86.7%, and 96.7%. A statistically significant relationship is a relationship observed in a sample that would have been unlikely to occur if really there is no relationship in the larger population. If we can show a linear relationship exists we can employ regression methods to construct a line. 1. Plot the two variables: make a scatter plot of exam scores and the quiz averages. 2. To measure the strength and direction of a linear relationship between two measurement variables we use correlation. The symbol used to represent the sample correlation is r (lower case r). Positive Negative Which of the following correlations indicate the strongest linear relationship? A. -.9 (Stronger because it is closer to 1 or -1) Negative slope B. 1.15 (beyond possible range) C. .85 (second strongest because it is close to 1 or -1) Positive slope D. 0 (no correlation) Our quiz average and exam 1 the correlation was r=0.65 Next we run a Statistical test of the significance on the correlation. We want to look for statistical significance. Example: You are a suspect in a criminal investigation. To begin with you are considered innocent. Assuming this is true, do you believe it likely or unlikely to gather evidence against that assumption? Probability Value (p-value): the probability the data produces a result assuming the null hypothesis is true. The p-value is the probability that the sample data produces a result given that the true or population correlation is zero. Therefore the smaller the p-value the stronger the evidence against the true correlation being zero. 1. Null Hypothesis, Ho 2. Legal example: Ho: Innocent 3. Class example: Ho: correlation is 0 4. Small p-values are signs of evidence against the Ho. We base the idea of small by using a baseline value called Level of Significance. 5. The most common level of significance is 5% or 0.05. 6. If the p-value is less than 0.05 we decide to reject the null hypothesis and conclude the opposite. 7. Correlation between quiz average and exam scores r= .650, P-value was .0004. 8. (P-value < .05) We reject null hypothesis and conclude the population correlation is not 0. 9. This would indicate then a linear relationship between quiz averages and exam score. 10. Algebra: Y=mx+b, Y=mx(slope)+b -> y-intercept 11. Statistics: y = b0 + b1 x, y = b0 + b1 x=> b0(y-intercept)+b1(slope)x 12. Exam: 33.1+0.60x 13. (Response variable) Exam= 33.1+0.60 quiz Avg x (explanatory variable) 14. Slope: for a unit increase in x, Y will increase/decrease its units by the value of the slope. What Exam score would we predict for an exam score of 70? Exam = 33.1 + 0.60 * 70 =33.1 + 42 = 75.1 - Remember slope and correlation are directly relate. o I.e. if slope is positive and so is correlation and vice versa. - Correlation has a range: -1 r 1, where sign only reflects direction - P-value: 0 p-value 1 o The smaller the p-value the stronger the evidence against null hypothesis. - Just because two variables are related does not mean one variable causes the other to occur. - Be aware of extrapolation (extending results beyond range of data used in analysis.) o E.S. Study don on age and weight for babies 0 to 22 months. o Regression equation: Weight= 6 + 0.9 Age in months o My age: weight= 6 + 0.9(552)= 503 pound - When we reject a known hypothesis we have statistically significantly evidence. This does not necessarily practical evidence. E.S. calf rises in jumping. - Outliers can raise issues with regression. o There are two types of outliers: General- Point A Influential- Point B Has big impact on regression, - When I provided correlation values did I ever use units? o Correlation is unit-free so if we were to change the units on one variable (e.g. change quiz average from % to decimal) the correlation would be unchanged. Ch 12 and 13: Comparing 2 Categorical Variables What is you gender? M/F Do you believe female’s sports should receive equal finances as men’s sports? To “organize” responses data we create a table where we have “r” number of rows fore one variable and “c” number of columns for the other variable where r and c are the number of levels for each variable. (“R x C” rows by columns) For rows we put explanatory variables; response on columns No 72 181 Females Males Yes 466 158 Instead of just looking at counts, we look at percents. Females Males Total No 72 181 253 Yes 466 158 624 Total 538 339 877 % No 13.38 53.39 100 What is the probability a student selected from survey is Yes? 624/877= .7115 % Yes 86.62 46.61 100 What proportion said yes? 624 What percentage said yes? 624/877 *100= 71.15% What is the risk that a student said yes? Risk: number in-group of interest (or the number with trait) divided by total number. 624/877=.7115 What are the odds? Odds: # in group of interest/# in group not of interest or # with trait / # without trait. 624/253= 2.5 to 1 (simplified) The odds of a student saying yes are 2.5 to 1. Horseracing: Betting 2 to 1, trait of interest is loosing, out of 3 races you will loose two times. What is the probability, risk, and proportion that a male student surveyed says yes? # with trait (158) / total in group (339)= .4661, Percentage= 46.61%, Odds= # with trait / # without trait= 158/181 (male yes/ male no) = 7/8, for every 15 males 7 will say yes, 8 will say no. Relative Risk (RR) (“more likely”) What is RR that females say yes to males say yes? Risk female yes/ Risk male risk=? Female risk= 466/538 Male risk= 158/339 (denominator risk known as baseline risk) RR= 1.85 (meaning females are 1.85 time more likely to say yes than males.) Increased Risk IR= (change in risk/baseline risk)*100 OR (relative risk-1.0)*100 Odds Ratio (OR) What is OR that females say yes to male say yes? Female odds= 466/72 Male odds= 158/181 OR= 7.4 (meaning the odds females say yes is 7.4 times the odds males say yes.) The importance of knowing baseline risk: Hypothetical: A study reports that females who binge drink are 10 times more likely to develop liver disease than females who do not binge drink. Without knowing baseline risk for females who do not binge drink in getting liver disease we can’t make an educated interpretation of this relative risk of 10 times. What if baseline risk was 1/10,000? That make the risk of females that do binge drink getting liver disease 10 * 1/10,000 or 1/1000. Ch 13 #1 Female Male No 269/50% 170/ 50% Yes 269/ 50% 169/ 50% Total 538 339 #2 Female Male No 160/ 30% 102/ 30% Yes 371/ 70% 237/ 70% Total 538 339 #3 Female Male No 72/ 13% 181/ 53% Yes 466/ 86% 158/ 46% Total 538 339 We conduct a Chi-square test of Independence Ho: There is no relationship in the population between the two variables (i.e. Independent) Ho: There is a relationship in the population between the two variables (i.e. dependent) We get a p-value comparing to .05; if p-value less than .05 we reject Ho and conclude a relationship exists between the two variables. Step 1: Get data and create Observed Data table (chart#3) Step 2: using this observed data, create an expected counts table under assumption of interdependence. We find this by finding the row and column totals compared to overall total. Expected Table Female Male No 538/253=877 339/253=877 Yes 538/624=877 339/624=877 Expected Count No Yes Female 155.2 (28.85%) 382.8 (71.15%) Male 97.8 (28.85%) 241.2 (71.15) Step 3: Calculate a chi-square statistic by: Chi-square=> X2= (observed-Expected)^2 / Expected X2= (72-155.2)^2/155.2 + (466-382.8)^2/382.8 +(181-97.8)^2/97.8 +(158-241.2)^2/242.2 Total= 161.18 Step 4: We get a p-value e.g. p-value is approximately .000 Step 5: Compare p-value to .05 e.g. .000 is less than .005, so we reject Ho and conclude that relationship exists in the population between gender and attitude about equal financial support. (Can’t draw a cause because of no random assignment.) (Random selection, the population is strictly that sample.) A primary concern in tests of independence is the existence of confounding or lurking variables. Over looking such variables can lead to a phenomenon known as Simpsons Paradox, where the results reported from aggregate data differ from results when the lurking confounding variable is included. Classic example of Simpson’s Paradox In 1972 Supreme Court ruled that racial disparities in assigning punishment in capital cases be eliminated. From 1976 through 1977 a gentleman named Revolet looked at capital case data in Florida. Defendant White Black Total Death Penalty No 141 149 290 Death Penalty Yes 19 17 36 Total %YES 160 166 11.9% 10.2% Considered the lurking variable of victims race: Victim being White Defendant No Yes Total % Yes White 132 19 151 12.6% Black 52 11 63 17.5% Victim being Black Defendant No White 9 Black 97 Yes 0 6 Total 9 103 No 141 149 % 0% Chapter 16: Probability Q: What is the probability that PSU men’s basketball team wins this Tuesday at Ohio State? 10%, 25%, 37% All arrived by personal opinion. My estimate is 24.5%. My probability was based on “relative frequency” In this case PSU has won 27 out of 137 road games since joining Big 10. Which of the following (or both) are true? 1. Law of large numbers: Over a long run period the expected mean or proportion of the population would be experienced. E.g. flip a coin 1000 times; we would expect about 500 heads and tails. 2. Law of small numbers: If a pattern occurs. E.g. 6 tails in a row then we are more likely to see a head on the next flip. Not True! Probability is the chance that something is the case or that an event will occur. Range of probabilities is 0 to 1. Trial is one number of repetitions in an experiment/study. Sample Space is the set all possible outcomes. E.g. coin flip->head, tail; die roll-> 1,2,3,4,5,6 Subset of Sample Space: E.g. get a tail on a coin flip; E.g. get an even number when die rolled. Complement is a set of outcomes not in the event of interest. E.g. if event of interest is “get a tail on coin flip” a complement is “not a tail” I.E. get a head. E.g. If event is get an even number on a roll of die” I.E. we get a 2 or 4 or 6. Complement would be 1, 3, or 5. Intersection is when two or more events overlap. i.e. share common outcomes. E.g. let event E be get even number on die roll (2,4,6), Let S be event we get a 6. Union: the event of all outcomes of two or more events. E.g. Let E be even numbers on role of die, let f be get a 5, the union of events E and F would be (2,4,5,6) Conditional Probability is the probability that event occurs given that another event has already happened. E.g. Let E be event get an even number on roll of die. Say we know the outcome is an even number, what is probability we got a 6? 1/3 Independent trials one event does not influence the other event. E.g. coin flip. Mutually Exclusive of Disjoint Events: events that do not have any common outcomes. Both events cannot occur at the same time. E.g. let E be even number on roll of die, Let F be we get a 5. Simple Events: are events of only one outcome. E.g. getting a 5. Notation: If we define two events as A, and B then P(A): is probability that event A occurs P(AC ) or P(A1): the complement of A P(AorB) or P(AUB: The union of A and B A stat class has 4 teaching assistants. 3 females( Lauren, Rona, Leila) 1 male (Tom) Each Ta is on chare of one section, Let event W be the Ta is female and let event M be TA is male. 1) Which event is a simple event? Event M 2) Are the events W and M mutually exclusive? Yes Since the event do not overlap. I.E. share any common outcomes. 3) If two students, unknown to each other, randomly select a TA are their choices independent? Yes 4) What is probability that a student randomly selects Lauren? ¼ 5) You are told your section TA is female. What is probability that Lauren is your TA? 1/3 Review 30 questions 22-24 from this exam 8 from last exam From Exam 1 -Table 8.1, percentiles -Interpretation, Emperical rule, Identifiing shape -Random Assignment vs. Random Selection Study quizzes, exam, and notes From Exam 2 Survey: asked students their gender and if they have ever cheated while in a committed relationship. No Yes Female 59 30 89 Male 29 9 38 127 What is the risk of females cheating? 30/89 Odds? 30/59 From the table, is the probability a randomly selected student cheats a personal opinion or a relative frequency? Relative frequency What would probability be? 39/127 What is the odds ratio for men having cheated compared to women? (9/29)/(30/59) What is the relative risk of males to females? (9/38)/(30/89) The P-value: .262, what conclusion can be made? Failing to reject/not statistically significant, we cannot say a relationship exists between gender and cheating. Did this data come from an observational study or an experiment? Observational study because of no random assignment. What if the student’s surveyed were randomly selected? Still Observational Study. If the results were statistically significant could we say gender causes the cheating (less than .005)? No because we had no random assignment. Which statistical method would be more appropriate for an analyzing the variables weight and height? Regression or Chi-Squared? Regression for quantitative/non categorical since both variables are measurement. Chi-Squared for categorical. Which following p-values represents the weakest statistical evidence? 1.5 (larger than one so its thrown out) .9 (weakest) 0.01 0.001 (strongest) -0.05 (Negative so its thrown out) Which following values represents the strongest correlation? 1.5 (larger than one so its thrown out) .9 strongest 0.01 0.001 0.08