Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Sociology 5811: Lecture 6: Probability, Probability Distributions, Normal Distributions Copyright © 2005 by Evan Schofer Do not copy or distribute without permission Announcements • Problem set #1 Due Today • Problem Set #2 handed out today; due in a week • Class Schedule • Done with univariate stats • Starting probability today Review: Z-Score • The Z-score: One way to assess relative placement of cases in a distribution – Can be used for comparisons, like quantiles • Converts all values of variables to a new scale, with mean of zero, S.D. of 1 – Scores typically run from about –3 to +3 • Formula: di (Yi Y ) Zi sY sY Probability Defined • Definition: “The probability of a particular outcome is the proportion of times that outcome would occur in a long run of repeated observations (Agresti & Finlay 1997, p. 81)” • Probability of event A defined as p(A): outcomes in which A occurs p( A) total number of outcomes • Example: Coin Flip… probability of “heads” – 1 outcome is “heads”, 2 total possible outcomes – p(“heads”) = 1 / 2 = .5 Probability • Question: What is the probability of picking a red marble out of a bowl with 2 red and 8 green? outcomes in which red occurs p(red ) total number of outcomes There are 2 outcomes that are red There are 10 total possible outcomes p(red) = 2 divided by 10 p(red) = .20 Frequencies and Probability • Note: The probability of picking a color relates to the frequency of each color in the jar – 8 green marbles, 2 red marbles, 10 total – p(Green) = .8 p(Red) = .2 • For nominal or ordinal variables: f ( x) p( x) N • Where, f(x) is the frequency of x in a sample Frequency Charts and Probability GSS Data (N=2904) HIGHEST YEAR OF SCHOOL COMPLETED 1000 Note that the total N is 2904 800 Note that 392 individuals have 16 years of education 600 Frequency 400 200 0 0 4 3 6 5 8 7 10 9 12 11 14 13 16 15 18 17 20 19 Probabilities: Nominal/Ordinal • Height of bars in a frequency chart reflects the probability of choosing cases from our dataset • If we pulled some case randomly from our data • What is the Probability of choosing a person from the dataset with 16 years of education? • Notation: p(Y=16) • Computed as number of people with 16 years of education (frequency) divided by total N: f (Y 16) 392 p(Y 16) .135 N 2904 Probability Distributions • In a frequency plot, the height of bars reflects frequency • Dividing each value by N converts a chart to a “probability distribution” • Indicating the probability of choosing an individual with a given value of Y • Entire plots can be converted to probability distributions • Shape of the distribution is preserved • Height of bar represents probabilities rather than frequencies. Probability Distribution Example HIGHEST YEAR OF SCHOOL COMPLETED .440 .330 As we calculated, p(Y=16) = .135 .220 Percent .110 00 0 4 3 6 5 8 7 10 9 12 11 14 13 16 15 HIGHEST YEAR OF SCHOOL COMPLETED 18 17 20 19 Probability: Continuous Variables • Continuous measures can take on an infinite number of values • So, it doesn’t make sense to think of the probability of picking any exact value • 1. Typically, only one case has a given value • The sample may contain a case with 16.238908 years of education: p(Y=16.238908) = 1/N • 2. Most exact values have a frequency of 0 • Ex: 0 cases with 16.48900242 years of education • The probability of p(Y=16.48900242) is zero. Continuous Distributions • Continuous distributions can be approximated by connecting peaks of a histogram: Line approximates height of bars for all values of Y Continuous Probability Distributions • For continuous probability distributions: • Probabilities are not associated with single values • e.g., the probability that Y=16 • Instead, probabilities are associated with a range of values • e.g., the probability that Y is between 15 and 20 • These are visually represented by the area under a distribution between 15 and 20 Area under curve in range p(Y in a range ) Total area under curve Continuous Probability Distributions p(Red) = Red Area / (Red Area + Blue Area) Probability Distributions: Notation • Notation: • Greek alpha () is used to refer to a probability for a continuous distribution • Notation: p(15<Y<20) = • = Probability of variable Y between 15 and 20 • You can also choose an open-ended range • p(Y>.4) = • Or multiple ranges • p(.2<Y<.4 and Y>8) = • Question: If p(Y>MdnY) = , what is ? Continuous Probability Distributions Examples • p(a<Y<b) = Continuous Probability Distributions Examples • p(Y<a) = Continuous Probability Distributions Examples • p(Y<a, Y>b) = The “Normal” Distribution • A particular shape of symmetrical distribution that comes up a lot • Some biological phenomena have this distribution, such as height, cholesterol levels • Certain statistical regularities take this form • It is a “Bell-Shaped” distribution • Note: not all bell-shaped curves are normal distributions. Example of a Normal Curve 1.2 1.0 .8 .6 .4 .2 0.0 -2.07 -1.21 -.36 .50 1.36 Normal Curve, Mean = .5, SD = .7 2.21 3.07 Normal Curves • Normal Curves are a “family” of curves • They all share the same general curvature and formula • But, there are infinite variations with different means, standard deviations • They have different centers (means) and are more or less spread out. • Examples of different normal curves: • Mean for male height = 70 inches, S.D = 4 • Mean for cholesterol = 182, S.D. = 38. Formula for Normal Curves • The shape of a normal probability distribution can be expressed as a function: p (Y ) • Where: – – – – e (Y Y ) 2 2 2 / 2 Y 2 Y e refers to a constant (2.718) refers to a constant (3.142) refers to the mean of the normal curve refers to the standard deviation of the normal curve. Properties of Normal Curves • If you choose a mean and s.d., you can plot a corresponding normal curve • Probability distributions can also be normal • Remember: the proportion of area under the curve in a given range is equal to the probability of picking someone in that range • Normal curves are useful because: • The probability of cases falling in a certain range on a normal curve are well known • Thus, it is easy to determine p(a<Y<b)! Properties of Normal Curves • Normal curves have well-known properties: – 68% of area under the curve (and thus cases) fall within 1 standard deviation of the mean – 95% of cases fall within 2 standard deviations – 99% of cases fall within 3 standard deviations • In fact, the a percentage can be easily determined for any number of standard deviations (e.g., s=1.5, s=2.3890) • Note: This is only true of normal curves • You can’t apply these rules to non-normal distributions. Properties of Normal Curves • The predictable link between standard deviations and percent of cases falling near the mean makes normal curves very useful • 1. You can determine the probability associated with any range of values around the mean • e.g., there is a .95 probability that a person randomly chosen will fall within 2 SD of mean • 2. You can convert Z-scores (# standard deviations) into something like a percentile • If a case falls 3 standard deviations above the mean, it must be in the 99th percentile. Properties of Normal Curves • Visually: Question: Why are these referred to as Z, 2Z, 3Z? Normal Distribution: Example • Male height is normally distributed – Distribution: mean = 70 inches, S.D. = 4 inches Question: Is this a frequency distribution or a probability distribution? 55 60 65 70 75 80 85 Normal Distribution: Example • Male height is normally distributed • Distribution: mean = 70 inches, S.D. = 4 inches • What is the range of heights that encompasses 99% of the population? • Hint: that’s +/- 3 standard deviations • Answer: 70 +/- (3)(4) = 70 +/- 12 • Range = 58" to 82“ • This is very useful information • Ex: If you are designing a car to comfortably fit most people. Normal Distribution: Example • 99% of cases fall within 3 S.D. of mean A total of 1% fall above 82 inches or below 58 inches 55 60 65 70 75 80 85 Normal Distributions and Inference • The link between normal distributions and probabilities allows us to draw conclusions • Example: Suppose you are a detective • You suspect that a person is taking an illegal drug • One side-effect of the drug is that it raises cholesterol to extremely high levels • Strategy: Take a sample of blood from person • Compare with known distribution for normal people • Observation: Blood cholesterol is 5 standard deviations above the mean… Normal Distributions and Inference • What can you tell by knowing cholesterol is 5 standard deviations above the mean? • 99% are within 3 standard deviations, 1% not • A much lower percentage fall 5 S.D’s from the mean • Based on properties of a normal curve: • Only .000000287 of cases fall 5 or more S.D’s from the mean • Conclusion: It is improbable that the person is not taking drugs • But, in a world of 6 billion people, there are 1,722 such people – you can’t be absolutely certain… Samples and Populations • Issue: As social scientists, we wish to describe and understand large sets of people (or organizations or countries) – School achievement of American teenagers – Fertility of individuals in Indonesia – Behavior of organizations in the auto industry • Problem: It is seldom possible to collect data on all relevant people (or organizations or countries) that we hope to study. Samples and Populations • How can we calculate the mean or standard deviation for a population, without data on most individuals? – Without even knowing the total N of the population? • Are we stuck? • IDEA: Maybe we can gain some understanding of large groups, even if we have information about only some of the cases within the group – We can examine part of the group and try to make intelligent guesses about what the entire group is like. Populations Defined • Population: The entire set of persons, objects, or events that have at least one common characteristic of interest to a researcher (Knoke, p. 15) • Populations (and things we’d like to study) – Voting age Americans (their political views) – 6th grade students attending a particular school (their performance on a math test) – People (their response to a new AIDS drug) – Small companies (their business strategies). Population: Defined • People in those populations have one common characteristic, even if they are different in many other ways – Example: Voting age Americans may differ wildly, but they share the fact that they are voting aged Americans • Beyond literal definition, a population is the general group that we wish to study and gain insight into. Sample: Defined • Sample: A subset of a population – Any subset, chosen in any way – But, manner of choosing makes some samples more useful than others – Datasets are usually samples of a larger population • Beyond literal definition, sample often means “the group that we have data on”. Statistical Inference: Defined • Our Goal: to describe populations • However, we only have data on a sample (a subset) of the population • We hope that studying a sample will give us some insight into the overall population • Statistical Inference: making statistical generalizations about a population from evidence contained in a sample (Knoke, 77). Statistical Inference • When is statistical inference likely to work? • 1. When a sample is large – If a sample approaches the size of the population, it is likely be a good reflection of that population • 2. When a sample is representative of the entire population – As opposed to a sample that is atypical in some way, and thus not reflective of the larger group. Random Samples • One way to get a representative sample is by choosing one randomly • Definition: A sample chosen from a population such that each observation has an equal chance of being selected (Knoke, p. 77) – Probability of selection: 1 p (selection ) N • Randomness is one strategy to avoid “bias”, the circumstance when a sample is not representative of the larger population. Biased Samples: Examples • Biased samples can lead to false conclusions about characteristics of populations • What are the problems with these samples? – Internet survey asking people the number of CDs they own (population = all Americans) – Telephone survey conducted during the day of political opinions (pop = voting age Americans) – Survey of an Intro Psych class on causes of stress and anxiety (pop = All humans) – Survey of Fortune 500 firms on reasons that firms succeed (pop = all companies). Statistical Inference • Statistical inference involves two tasks: • 1. Using information from a sample to estimate properties of the population • 2. Using laws of statistics and information from the sample to determine how close our estimate is likely to be – We can determine whether or not we are confident in our assessment of a population Statistical Inference Example • Population: Students in the United States • Sample: Individuals in this classroom • Question: What is the mean number of CD’s owned by students in the US? – Goal #1: Use information on students in this class to guess the mean number of CD’s owned by students in the US – Goal #2: Try to determine how close (or far off) our estimate of the population mean might be. Estimate the quality of the guess. • Part #2 helps prevent us from drawing inappropriate conclusions from #1 Population and Sample Notation • Characteristics of populations are called parameters • Characteristics of a sample are called statistics • To keep things straight, mathematicians use Greek letters to refer to populations and Roman letters to refer to samples – – – – Mean of sample is: Y-bar Mean of population is Greek mu: μ Standard deviation of sample is: s Standard deviation of a population is lower case Greek sigma: σ Population and Sample Notation • Estimates of a population parameter based on information from a sample is called a “point estimate” – Example of a point estimate: Based on this sample, I’d guess that the mean # of CDs owned by students in the U.S. is 47. • Formulas to estimate a population parameter from a sample are “estimators” Estimation: Notation • We often wish to estimate population parameters, using information from a sample we have • We may use a variety of formulas to do this • Mathematicians identify estimates of population parameters in formulas by placing a caret (“^” ) over the parameter – The caret is called a “hat” – An estimate of is called “sigma-hat” – Symbol: σ̂ Population and Sample Distributions Y s Populations and Samples • Population parameters (μ, σ) are constants – There is one true value, but it is unknown • Sample statistics (Y-bar, s) are variables – Up until now we’ve treated them as constants – There are many possible samples, and thus many possible values for each – In fact, the range of possible values makes up a distribution – the “sampling distribution” • This provides insight into the probable location of the population mean – Even if you only have one single sample to look at – This “trick” lets us draw conclusions!!!