Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Transcript

or How to make the numbers say whatever you want. Acknowledgements: Darrell Huff’s book: How to Lie With Statistics (First published in 1954) Danny Oppenheimer’s psychology course Other books that may be interesting: Statistical Tricks and Traps - by Ennis C. Almer. “There are lies, damned lies, and statistics” -- Disraeli Anecdotal evidence is unreliable Why does the phone always ring when you’re in the shower? Determining the difference between chance and real effects • Practically all statistics are based on a sample of a population. So…... – – – – how was the sample chosen? how big is the sample? what population does it claim to represent? what population does it actually represent? Data = Signal + Noise Signal = What we’re trying to measure Noise = Error in our measurement If noise is random, then as the sample size increases, noise tends to cancel, leaving only signal. • Flip a coin 5 times – Heads four times – 80 % heads • Flip 100 times – Same results??? • In general, the larger the sample size, the better the estimation. • A telephone poll was taken during the U.S. presidential campaign of Franklin D. Roosevelt (1932). • Based on the results of that poll, the pollsters predicted that FDR would not win. FDR did win, however. • The poll did NOT accurately reflect all of the voters because the opinions of only one part of the population (wealthy people with telephones) were taken into account. Center • Mode • Median • Mean Spread • Variability • Standard Deviation • Range Mode • The mode is any value that occurs most frequently • 10 15 20 20 22 30 30 40 40 50 50 50 60 70 70 79 80 100 100 125 200 200 300 400 450 • The mode in this case is 50 Median (M) • The midpoint of a distribution. • Half (50%) of the observations are larger, half (50%) are smaller. Steps to Find Median • Arrange all the observations in order of size, from smallest to largest. • Find the position of the median by counting the number of observations and finding the one in the middle. • If there is an odd number of observations, the median will be one of your observations • If there is an even number of observations, the median is the average of the center observations in the ordered list and will not be an observation. Example • How many CD’s do you own? • 60 20 15 79 30 200 200 400 10 40 50 22 125 20 60 40 80 100 30 50 100 300 70 70 50 450 • 10 15 20 20 22 30 30 40 40 50 50 50 60 70 70 79 80 100 100 125 200 200 300 400 450 Mean • Steps – Add the values of the observations – Divide the total by the number of observations – That’s the mean! Mean – arithmetic average = x/n Median – the halfway point Mode – the most common answer Average could mean any of them… Incomes: • $9000 • $9000 • $9000 • $12,000 • $120,000 • $85,000 • $15,000 Mean = $37,000 Median = $12,000 Mode = $9000 Each is a legitimate average but can serve conflicting purposes Range – Overall difference between the highest and lowest scores. Variance – Average difference from the mean. Standard Deviation • Measures the spread by looking at how far the observations are from their mean. • Measures the average distance of the observations from their mean. • Variance (s2) is the average of the squared distances from the mean. • Standard Deviation is the square root of the variance. ( x1 x ) ( x2 x ) ... ( xn x ) s n 1 2 2 2 2 Identical Range 1 9 9 11 11 11 9 9 9 11 11 19 1 1 1 19 19 19 1 1 1 19 19 19 Identical Variance 1 9 9 11 11 11 9 9 9 11 11 19 6 6 6 13 14 14 6 6 7 14 14 14 Median vs. Mean • • The midpoint and mean both describe the center, but which is better? The mean is strongly influenced by a few extreme observations and the median is not The Normal Curve Central Limit Theorem Any time you have a measure which is created by summing several individual trials of data (Signal + Noise) you will end up with a normal curve Bimodal Distributions Skewed Distributions Uniform distributions Correlation measures the strength of a relationship between two variables. Positive Correlation Negative Correlation No Correlation 0<R<1 -1 < R < 0 R=0 Correlation does not imply causation A person wearing red is 129 times more likely to be gored by a charging bull! But what is the base rate? (more likely than what or who?) 98% of Americans have internet access available. The average temperature in Oklahoma City over the past 100 years is 63.4 degrees. (and each year has been within 1 degree of that average) Consumer spending on prescription drugs has doubled since 1980. 85% of all car accidents occur within 10 miles of the home. 10 times as many people die each year by plane crashes than train crashes. During the Spanish-American war, the death rate for soldiers was 9 per 1000. The death rate for civilians in New York City was 16 per 1000. Fluffy O’s Cereal Gives a body Energy!! 10 ?? 5 1 minute 2 minutes 62.5 62 61.5 Earnings 61 60.5 60 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr Earnings 100 90 80 70 60 50 40 30 20 10 0 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr 40 Even when the scale is fine, graphs can still be deceptive 35 30 25 20 15 10 5 0 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr Even when the scale is fine, graphs can still be deceptive 40 35 30 25 20 15 10 5 0 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr As you can see, Bob earns substantially more than Joe Joe’s income Bob’s Income Is it really a two to one ratio that’s being displayed? Portion of US income going to federal taxes Federal spending has become equal to the total income of the people of the black states. Percent of US income going to federal taxes Tylenol is used by 90% doctors for their own aches and pains! So are Aspirin, Ibuprofen, and Bayer! When Dewey was elected Governor in 1942, the minimum teacher’s salary in some districts was as low as $900 per year. Upon Governor Dewey’s recommendation… the Legislature in 1947 appropriated $32,000,000 out of state surplus to provide an increase in teacher’s salaries. As a result, the minimum salaries of teachers in New York City range from $2500 to $5325. 1) Stanford is the #1 program in psychology 2) I am the only (and therefore #1) student at Stanford studying Decision Errors Therefore: I am the top student studying Decision Errors in the country. 1) Statistics are useful for figuring out random noise from real effects 2) Numbers are not absolute, and they can be easily manipulated 3) Always scrutinize data closely, and draw your own conclusions. 4) 85% of all statistics are made up on the spot: the rest are all wrong