Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
STAT 100 Review Section Week 1 & 2 Winnie Wu [email protected] Office Hours By appointment Location Online or on campus Key Topics • • • • Association vs. Causation Design of Experiments and Survey Methods Randomization Random Sampling-types of sampling bias Causation vs. Association For each example, list whether we you believe there is causation, or some confounding/lurking variable a) Smokers have higher rates of lung cancer. • Heavy coffee drinking is associated with higher rates of smoking. • Heavy alcohol consumption is associated with higher rates of smoking. b) Couples that live together before being married are more likely to get divorced later. • “Couples who are more confident about their relationship are more likely to get married straight away. Hence, more stable couples are less likely to live together before marriage than less stable couples. Living together per se is not the problem. The real problem is that a deeper source of instability is correlated with cohabitation.” • “More religious couples are less likely to get divorced and less likely to live together before marriage. ” c) Dog owners live longer. • It turned out that walking with a dog gave seniors a boost in parasympathetic nervous system activity, which is good because the parasympathetic nervous system helps calm and rest the body. Sample Survey In 1987, Shere Hite authored a book entitled Women and Love: A Cultural Revolution in Progress(http://www.amazon.com/WomenLove-Cultural-RevolutionProgress/dp/0394530527) which reported some very captivating survey results on women's intimacy and love relationships. She reported the following: Sample Survey • 84% of women are “not satisfied emotionally with their relationships” (p. 804) • 70% of all women “married five or more years are having sex outside of their marriages (p. 856) • 95% of women “report forms of emotional and psychological harassment from men with whom they are in love relationships” (p. 810) • 84% of women report forms of condescension from the men in their love relationships (p. 809) Sample Survey Hite collected her sample by sending surveys to 100,000 women via mail. She mailed the questionnaires to addresses collected from mailing lists of groups of women professionals, counseling centers, church societies and senior citizen groups. She received about 4,500 surveys in response. Sample Survey Obviously, this is not an example of great survey sampling. For Hite's data collection techniques, give an example of each of the following: • Selection Bias • Response Bias • Non-Response Bias KEY TOPICS • Descriptive statistics Center: mean, meadian Spread: SD, range, IQR (and outlier detection) Percentiles • Graphics (bar plots, histograms, boxplots, scatterplot) • Other concepts Shape: symmetric, skewed, bell-shaped Resistance, Outliers Z-scores Mean, Definition n x xi / n i 1 24 i 1 xi 4507, x 4507 / 24 187.8 mg/dl. 17 Mean Advantages vs. Disadvantages Advantages • It is representative of all the points. • If the underlying distribution is normal, then it is the most efficient estimator of the middle of the distribution. • Many statistical tests are based on the mean. 18 Mean Advantages vs. Disadvantages Disadvantages • It is very sensitive to outliers, e.g., if one of the cholesterol levels were 800 rather than 200 then the mean would be increased by 25 units. 19 Mean Advantages vs. Disadvantages • It is inappropriate if the underlying distribution is far from being normal, for example, a distribution which looks highly skewed. 20 Median Advantages • Always guarantees that 50% of the data values are on either side of the median. • Insensitive to outliers. 21 Median Disadvantages • It is not as efficient an estimator of the middle as the mean if the distribution really is normal in that it is mostly sensitive to the middle of the distribution. • Most statistical procedures are based on the mean. 22 Skewness 23 24 ****Box plot of cholesterol_before and cholesterol_after 25 SAMPLE QUESTION #1 Suppose we measure the amount of weight 5 Harvard Football players can bench-press, and we record the following observations (in pounds): 280, 250, 355, 275, 290. What are the mean, median, and standard deviation for these observations? SAMPLE QUESTION #1 Suppose we measure the amount of weight 5 Harvard Football players can bench-press, and we record the following observations (in pounds): 280, 250, 355, 275, 290. a) What are the mean, median, and standard deviation for these observations? Mean = 290 Median = 280 Standard Deviation = 39.21 SAMPLE QUESTION #1 Suppose we measure the amount of weight 5 Harvard Football players can bench-press, and we record the following observations (in pounds): 280, 250, 355, 275, 290. a) What are the mean, median, and standard deviation for these observations? Mean = 290 Median = 280 Standard Deviation = 39.21 b) What would happen to the mean, median and sd if another player (let’s say the kicker) joined the study and lifted only 220 lbs? SAMPLE QUESTION #1 Suppose we measure the amount of weight 5 Harvard Football players can benchpress, and we record the following observations (in pounds): 280, 250, 355, 275, 290. a) What are the mean, median, and standard deviation for these observations? Mean = 290 Median = 280 Standard Deviation = 39.21 b) What would happen to the mean, median and sd if another player (let’s say the kicker) joined the study and lifted only 220 lbs? The median would be 275+280=277.5 The mean would be 278.33 and the standard deviation would be 45.24 SAMPLE QUESTION #2 Female Heights in US ~ N(μ = 63.8in, σ = 2.5in) [http://en.wikipedia.org/wiki/Human_height] Male Heights in US ~ N(μ = 69.2in, σ = 2.8in) [http://hypertextbook.com/facts/2007/SimasCeckauskas.shtml] a) What is the probability that your male TF is only 68in tall or shorter? b) How tall does your male TF have to be in order to be taller than 90% of the US population? c) Shaquille O’Neal is 85 inches tall. What proportion of the US population is as tall as Shaq (or taller)? SAMPLE QUESTION #2 Female Heights in US ~ N(μ = 63.8in, σ = 2.5in) [http://en.wikipedia.org/wiki/Human_height] Male Heights in US ~ N(μ = 69.2in, σ = 2.8in) [http://hypertextbook.com/facts/2007/SimasCeckauskas.shtml] a) What is the probability that your male TF is only 68in tall or shorter? Using the equation from class and the given information for male height in the US, we get: z-score= (68 – 69.2)/2.8 = -1.2/2.8 = -0.4286 From the z table, 0.4286 corresponds to a 33.36% chance that the male TF is 68in or shorter. b) How tall does your male TF have to be in order to be taller than 90% of the US population? Here we are working backwards. From the table, we find that 90% corresponds to a z-score =1.28. Therefore, z-score = (x – 69.2) / 2.8 = 1.28 We can solve for x by rearranging the above equation: (1.28 * 2.8) + 69.2 = x = 72.78 in The male TF needs to be 72.78 inches in order to be taller than 90% of the US population. c) Shaquille O’Neal is 85 inches tall. What proportion of the US population is as tall as Shaq (or taller)? Using the equation from class and the given information for male height in the US, we get: z-score= (85 – 69.2)/2.8 = 15.8/2.8 = 5.642 From the z table, 5.642 corresponds to less than a 0.01% chance that the US population is as tall as Shaq or taller. Help in R • To find help on a particular topic, you can type for example: • help.search(“box plot”) • If you already know the name of the command you can type • ?boxplot 32