Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The eternal tension in statistics... Between what you really really want (the population) but can never get to... So you have to make do (with the sample) you can estimate the population, make educated guesses, but bottomline is “you can never have the population” An investigator usually wants to generalize about a class of individuals/things (the population) For example: in forecasting the results of elections, population = voters for the Consumer Attitudes Survey: Population = all potential users of Cell Phones • Parameters: Usually there are some numerical facts about the population which you want to estimate • Statistic: You can do that by measuring the same aspect in the sample (Descriptive Statistics) • Depending on the accuracy of measurement, and representativeness of your sample, you can make inferences about the population (Inferential Statistics) • One person’s sample is another person’s population – IS 271 students are a sample for the larger student population of UC Berkeley – IS271 students could be population for some other study Understanding Populations and Samples with brown M&M’s Yellow 20% Brown 30% Orange 10% Blue 10% Red 20% Green 10% Original Distribution The distribution of the population Sample 1 Sample 2 Sample 3 Population Sample 1 Sample2 Sample3 Sample3 5 Samples It is a remarkable fact that many histograms in real life tend to follow the Normal Curve. For such histograms, the mean and SD are good summary statistics. The average pins down the center, while the SD gives the spread. For histogram which do not follow the normal Curve, the mean and SD are not good summary statistics. What when the histogram is not normal ... Properties of the Normal Probability Curve • The graph is symmetric about the mean (the part to the right is a mirror image of the part to the left) • The total area under the curve equals 100% • Curve is always above horizontal axis • Appears to stop after a certain point (the curve gets really low) 1 SD= 68% 2 SD = 95% 3 SD= 99.7% • The graph is symmetric about the mean = • The total area under the curve equals 100% • Mean to 1 SD = +- 68% • Mean to 2 SD = +- 95% • Mean to 3 SD = +- 99.7% • You can disregard rest of curve Distribution of judges ratings for the Webby Awards (scale of 1 –10) 500 400 Mean = 6.3 Median = 6.3 300 Std. Dev = 1.98 200 N = 1867.00 100 Skewness = -.43 Kurtosis = -.201 0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 Distribution of word count on web pages 500 400 300 Std. Dev = 384.83 Mean = 348.3 200 100 0 +- 3 SD = (384 * 3) = 1152 Mean - 1152 = about 30% sample had negative number of links Measures of Normality • Visual examination • Skewness: measure of symmetry Positively Skewed Negatively Skewed Symmetric Kurtosis: Does it cluster in the middle? Kurtosis is based on a distributions tail. Distributions with a large tail: leptokurtic Distributions with a small tail: platykurtic Distributions with a normal tail: mesokurtic Large tail Small tail Normal Tail Positively Skewed and Leptokurtic: Word Count 1600 1400 1200 1000 800 Mean = 393.2 Median = 223 Std. Dev = 725.24 Skewness = 13.62 Kurtosis = 321.84 600 N = 1903.00 400 200 0 Distribution of word count (N=1897) top six removed 800 Kurtosis = 16.40 Skewness = 3.49 600 400 Mean = 368.0 Median = 223 Std. Dev = 474.04 N = 1897.00 200 0 The Importance of Good Sampling Techniques The 1936 election: the literary digest poll • Candidates: Democrat FD Roosevelt and Republican Alfred Landon • The Literary Digest: had called the winner in every election since 1916 • Its prediction: Roosevelt will get 43% • Sample Size: 2.4 million people! The election results Percentage vote for Roosevelt • The election result 62% • The Digest prediction 43% Literary Digest went bankrupt soon after George Gallup just setting up his organization Gallup’s prediction of Digest Prediction 44% (Sample size = 3000) Gallups’s prediction of election result (Sample Size = 50,000) 56% Why the Digest went wrong: How they picked their sample • Selection Bias: A systematic tendency on the part of the sampling procedure to exclude one kind of person or another from sample • Sample Size: When a selection procedure is biased, making the sample larger does not help: repeats the mistake on a larger level How they picked their sample • Non Response Bias: Non respondents differ from respondents – they did not respond as compared to respondents who did! – Lower income and upper income people tend not to respond, so middle class over represented. – Non Response Bias: One can give more weightage to people who were available but hard to get. For Example: Predicting Elections – Non Voters: Gallup uses a few questions to predict if people will vote at all. Election forecast based only on those likely to vote. – Undecided: Asks people who they are leaning towards as of today. – Non Response Bias: One can give more weightage to people who were available but hard to get. – Ratio Estimation: Look at sample obtained, and compares it to population. If there are too many educated people weigh them lesser. – Interviewer Bias: Build redundancy into questionnaire to check for consistency. Also reinterview a small sample to check for consistency. How much is each sample going to deviate from the population? (how big is the chance error for each sample likely to be?) Computation of Standard Error SD of sample / number of samples List of numbers: 9, 7, 6, 9, 11, 12 Mean = 9, Standard Deviation = 2.2 Standard Error = .93 Standard Error varies in inverse proportion to the square root of the number of samples. Therefore, as number of samples grows bigger, standard error grows smaller. If there is a lot of spread in the samples, the SD is big and it will be hard to predict how accurate the sample will be. So the standard error will be big as well. Standard Deviation (SD) and Standard Error (SE): SD refers to a list of number. How far are most numbers from the mean? SE refers to the variability in samples. How variable is each sample going to be. Understanding the computation of the standard error Standard Error is directly related to how variable the numbers in the sample are. Therefore it is directly related to the standard deviation. But Standard Error is also related to the sample size. The larger the sample size, the lesser the chances of chance error. Therefore it is inversely related to the square root of the sample size. Why is knowing chance error important? • Allows us to estimate the accuracy of our estimates and if we are justified in using inferential statistics. • Allows us to make inferences about the population, by accounting for chance error Three components of a measurement True Value + Systematic Bias + Chance Error -you want to get at true value -you want to eliminate systematic bias -you want to estimate chance error Estimating Sample Size: Should the sample size for Texas be larger than that for Rhode Island? Surprisingly: No Analogy: If you took a drop of liquid for analysis. If the liquid is well mixed, then it would not matter if the liquid was from a small or a large bottle, whether the sample is 1% or .1% of the population.. The statistical rationale: The accuracy of sampling is related to the standard deviation of the sample. Example: Election of 1992, % voters who chose Clinton 46% of voters in New Mexico, SD =.50 37% of voters in Texas, SD =.48 Therefor accuracy of sample in Texas and New Mexico will be similar Types of Samples • The convenient sample: More convenient elementary units are chosen from a population. • The judgement sample: Units are chosen according to judgement made by someone who is familiar with the relevant characteristics of the population. • The random sample: Units are chosen randomly with a known probability. • Quota Sampling: Each interviewer is assigned a fixed quota of subjects fitting certain demographic characteristics. Within the quota is a judgement sample. – Problems: quotas might not be representative, and judgement sampling is bad. Types of Random Sample • Simple Random Sample: Every unit of the population has an equal chance of being chosen. • A systematic random sample: One unit is chosen on a random basis, additional elementary units are taken from evenly spaced intervals until the desired number of units is obtained. • The stratified random sample: Obtained by independently selecting a separate simple random sample from each population stratum. A population can be divided into different groups:based on some characteristic or variable like income of education. • The cluster sample: Obtained by selecting clusters from the population on the basis of simple random sampling. The sample comprises a census of each random cluster selected. For example, a cluster may be some thing like a village or a school, a state.