Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Sociology 5811: Lecture 6: Samples, Populations Copyright © 2005 by Evan Schofer Do not copy or distribute without permission Announcements • Problem set #2 due next Tuesday, Sept 27 Problem Set: Z-table • Several problems require looking up area under the normal curve associated with certain Z-scores • Requires use of “Z-table” • Found on Knoke, p. 459 • Issue: We know that 95% of area under a normal curve falls within +/- 2 standard deviations • Thus: Area under normal curve from Z = -2 to Z = 2 is equal to .95 • Area left of Z = -2 and right of Z = 2 is .05 • But, what if we want area for a value like 1.4? • Z-table lists areas for all values! Problem Set: Z-table Area from 0 to Z = .15 Area beyond +Z = .35 Let’s look at Z=.40 Question: What is Area from -Z to +Z? Review: Probability • Probability of event A defined as p(A): outcomes in which A occurs p( A) total number of outcomes • “The probability of a particular outcome is the proportion of times that outcome would occur in a long run of repeated observations (Agresti & Finlay 1997, p. 81)” p(red) = 2 divided by 10 p(red) = .20 Probability • Question: What is the probability of picking red twice in a row (assuming you replaced the red one after you picked)? • Answer: Combined probabilities multiply • Each probability is .20 • .20 x .20 = .04 • Under 5% chance! • Conclusion: If you pick many times, you are unlikely to continually get atypical colors • It can happen, but it is very improbable. • Ex: Picking red 5 times: Probability is .00032. Review: Probability Distributions • Both nominal/ordinal and continuous measures can be conceived of as probability distributions – Nominal/Ordinal: Height of bars indicates probability of picking someone with that value – Continuous: Can’t be graphed in separate bars • Instead, a continuous curve approximates probability • Area under curve = probability of picking a case within a given range of values. Review: Probability Distributions • Notation: – Greek alpha () is used to refer to probabilities in a range for a continuous distribution Review: Probability Distributions • P(Y<a)= Review: Probability Distributions • P(Y<a, Y>b)= Review: Normal Distributions • Normal curves have well-known properties: • 68% of area under the curve (and thus cases) fall within 1 standard deviation of the mean • 95% of cases fall within 2 standard deviations • 99% of cases fall within 3 standard deviations • Percentages translate directly into probabilities • Thus, it is easy to determine the probability associated with any range around the mean • e.g., there is a .95 probability that a person randomly chosen will fall within 2 SD of mean • This property makes normal curves very useful! Samples and Populations • Issue: As social scientists, we wish to describe and understand large sets of people (or organizations or countries) • School achievement of American teenagers • Fertility of individuals in Indonesia • Behavior of organizations in the auto industry • Problem: It is seldom possible to collect data on all relevant people (or organizations or countries) that we hope to study. Samples and Populations • How can we calculate the mean or standard deviation for a population, without data on most individuals? • Without even knowing the total N of the population? • Are we stuck? • IDEA: Maybe we can gain some understanding of large groups, even if we have information about only some of the cases within the group • We can examine part of the group and try to make intelligent guesses about what the entire group is like. Populations Defined • Population: The entire set of persons, objects, or events that have at least one common characteristic of interest to a researcher (Knoke, p. 15) • Populations (and things we’d like to study) • Voting age Americans (their political views) • 6th grade students attending a particular school (their performance on a math test) • People (their response to a new AIDS drug) • Small companies (their business strategies). Population: Defined • People in those populations have one common characteristic, even if they are different in many other ways • Example: Voting age Americans may differ wildly, but they share the fact that they are voting aged Americans • Beyond literal definition, a population is the general group that we wish to study and gain insight into. Sample: Defined • Sample: A subset of a population • Any subset, chosen in any way • But, manner of choosing makes some samples more useful than others • Datasets are usually samples of a larger population • Beyond literal definition, sample often means “the group that we have data on”. Statistical Inference: Defined • Our Goal: to describe populations – However, we only have data on a sample (a subset) of the population – We hope that studying a sample will give us some insight into the overall population • Statistical Inference: making statistical generalizations about a population from evidence contained in a sample (Knoke, 77). Statistical Inference • When is statistical inference likely to work? • 1. When a sample is large • If a sample approaches the size of the population, it is likely be a good reflection of that population • 2. When a sample is representative of the entire population • As opposed to a sample that is atypical in some way, and thus not reflective of the larger group. Random Samples • One way to get a representative sample is by choosing one randomly • Definition: A sample chosen from a population such that each observation has an equal chance of being selected (Knoke, p. 77) – Probability of selection: 1 p (selection ) N • Randomness is one strategy to avoid “bias”, the circumstance when a sample is not representative of the larger population. Biased Samples: Examples • Biased samples can lead to false conclusions about characteristics of populations • What are the problems with these samples? – Internet survey asking people the number of CDs they own (population = all Americans) – Telephone survey conducted during the day of political opinions (pop = voting age Americans) – Survey of an Intro Psych class on causes of stress and anxiety (pop = All humans) – Survey of Fortune 500 firms on reasons that firms succeed (pop = all companies). Statistical Inference • Statistical inference involves two tasks: • 1. Using information from a sample to estimate properties of the population • 2. Using laws of statistics and information from the sample to determine how close our estimate is likely to be – We can determine whether or not we are confident in our assessment of a population Statistical Inference Example • Population: Students in the United States • Sample: Individuals in this classroom • Question: What is the mean number of CD’s owned by students in the US? • Goal #1: Use information on students in this class to guess the mean number of CD’s owned by students in the US • Goal #2: Try to determine how close (or far off) our estimate of the population mean might be. Estimate the quality of the guess. • Part #2 helps prevent us from drawing inappropriate conclusions from #1 Population and Sample Notation • Characteristics of populations are called parameters • Characteristics of a sample are called statistics • To keep things straight, mathematicians use Greek letters to refer to populations and Roman letters to refer to samples – – – – Mean of sample is: Y-bar Mean of population is Greek mu: μ Standard deviation of sample is: s Standard deviation of a population is lower case Greek sigma: σ Population and Sample Notation • Estimates of a population parameter based on information from a sample is called a “point estimate” – Example of a point estimate: • Based on this sample, I estimate that the mean # of CDs owned by students in the U.S. is 47. • Formulas to estimate a population parameter from a sample are “estimators”. Estimation: Notation • We often wish to estimate population parameters, using information from a sample we have • We may use a variety of formulas to do this • Mathematicians identify estimates of population parameters in formulas by placing a caret (“^” ) over the parameter – The caret is called a “hat” – An estimate of is called “sigma-hat” – Symbol: σ̂ Populations and Samples • Population parameters (μ, σ) are constants • There is one true value, but it is usually unknown • Sample statistics (Y-bar, s) are variables • Up until now we’ve treated them as constants • But, there are many possible samples • Different samples yield different values of the mean & S.D. – Like any variable, the mean and S.D. have a distribution! • Called the “sampling distribution” • Made up of all values for any given population • We’ll discuss it later… Population and Sample Distributions Y s Population Distributions • Population distributions are typically conceived of as probability distributions • Because we don’t usually see the whole thing… We just pull individuals out based on relative probability • Some populations are finite and could graphed as a raw frequency plot or histogram (examples?) • Many populations are infinite, can’t ever be graphed as a frequency plot/histogram (examples?) • The main thing that matters about a population is how likely you are to pick a person with a given value (or in a range of values). Populations and Samples: Overview Characteristics Characteristics are: Notation Estimate Population Sample “parameters” “statistics” constant (one variables (varies for population) for each sample) Roman ( Y , s) Greek (, ) “hat”: σ̂ “point estimate” based on sample Review: Normal Distribution • Example: Blood Cholesterol • normally distributed • mean = 200 • S.D. = 40 • What is the range of cholesterol that encompasses 95% of the population? • Answer: 200 +/- (2)(40) = 200 +/- 80 – Range = 120 to 280 Normal Distributions and Inference • The link between normal distributions and probabilities allows us to draw conclusions • Example: Suppose you are a detective • You suspect that a person is taking an illegal drug • One side-effect of the drug is that it raises cholesterol to extremely high levels • Strategy: Take a sample of blood from person • Compare with known distribution for normal people • Observation: Blood cholesterol is 5 standard deviations above the mean… Normal Distributions and Inference • What can you tell by knowing cholesterol is 5 standard deviations above the mean? • 99% are within 3 standard deviations, 1% not • A much lower percentage fall 5 S.D’s from the mean • Based on properties of a normal curve: • Only .000000287 of cases fall 5 or more S.D’s from the mean • Conclusion: It is improbable that the person is not taking drugs • But, in a world of 6 billion people, there are 1,722 such people – you can’t be absolutely certain…