Social Statistics: Are your curves normal? 1 This week Why understanding probability is important? What is normal curve How to compute and interpret z scores. 2 What is probability? The chance of winning a lotter The chance to get a head on one flip of a coin Determine the degree of confidence to state a finding 3 Normal distribution Percentages Under the Normal Curve Almost 100% of the scores fall between (-3SD, +3SD) Around 34% of the scores fall between (0, 1SD) Are all distributions normal? 4 Normal distribution The distance between contains Range (if mean=100, SD=10) Mean and 1SD 34.13% of all cases 100-110 1SD and 2SD 13.59% of all cases 110-120 2SD and 3SD 2.15% of all cases 120-130 >3SD 0.13% of all cases >130 Mean and -1SD 34.13% of all cases 90-100 -1SD and -2SD 13.59% of all cases 80-90 -2SD and -3SD 2.15% of all cases 70-80 < -3SD 0.13% of all cases <70 5 Z score – standard score If you want to compare individuals in different distributions Z scores are comparable because they are standardized in units of standard deviations. 6 Z score Standard score z X X: the individual score : the mean : standard deviation 7 Sample or population? Standard Normal Distribution 8 Mean=0, standard deviation=1 Z score Mean and SD for Z distribution? Mean=25, SD=2, what is the z score for 23, 27, 30? 9 Z score Z scores across different distributions are comparable Z scores represent the distances from the mean in a same measurement Raw score 12.8 (mean=12, SD=2) z=+0.4 Raw score 64 (mean=58, SD=15) z=+0.4 Equal distances from the mean 10 Comparing apples and oranges: Eric competes in two track events: standing long jump and javelin. His long jump is 49 inches, and his javelin throw was 92 ft. He then measures all the other competitors in both events and calculates the mean and standard deviation: Javelin: M = 86ft, s = 10ft Long Jump: M = 44, s = 4 Which event did Eric do best in? 11 Excel for z score Standardize(x, mean, standard deviation) (x-average(array))/STDEV(array) 12 What z scores represent? Raw scores below the mean has negative z scores Raw scores above the mean has positive z scores Representing the number of standard deviations from the mean 13 The more extreme the z score, the further it is from the mean, What z scores represent? 84% of all the scores fall below a z score of +1 (why?) 16% of all the scores fall above a z score of +1 (why?) This percentage represents the probability of a certain score occurring, or an event happening If less than 5%, then this event is unlikely to happen 14 Exercise In a normal distribution with a mean of 100 and a standard deviation of 10, what is the probability that any one score will be 110 or above? What about 6σ http://en.wikipedia.org/wiki/Six_Sigma 15 NORM.DIST() NORM.DIST(z,mean,standard_dev,cumulative) 16 z: The z score value for which you want the distribution. mean: The arithmetic mean of the distribution. cumulative: A logical value that determines the form of the function. If cumulative is TRUE, NORM.DIST returns the cumulative distribution function; if FALSE, it returns the probability mass function (which gives the probability that a discrete random variable is exactly equal to some value). NORM.DIST() 17 Exercise The probability associated with z=1.38 41.62% of all the cases in the distribution fall between mean and 1.38 standard deviation, About 92% falls below a 1.38 standard deviation How and why? 18 Between two z scores What is the probability to fall between z score of 1.5 and 2.5 Z=1.5, 43.32% Z=2.5, 49.38% So around 6% of the all the cases of the distribution fall between 1.5 and 2.5 standard deviation. 19 Exercise 20 What is the percentage for data to fall between 110 and 125 with the distribution of mean=100 and SD=10 Exercise 21 The probability of a particular score occurring between a z score of +1 and a z score of +2.5 Exercise Compute the z scores where mean=50 and the standard deviation =5 55 50 60 57.5 46 22 Exercise The math section of the SAT has a μ = 500 and σ = 100. If you selected a person at random: a) What is the probability he would have a score greater than 650? b) What is the probability he would have a score between 400 and 500? c) What is the probability he would have a score between 630 and 700? 23 Determine sample size Number of Responses Needed Sample Size Expected Response Rate Expected response rate: obtain based on historical data Number of responses needed: use formula to calculate 24 Number of responses needed Z x 2 n 2 e2 n=number of responses needed (sample size) Z=the number of standard deviations that describe the precision of the results e=accuracy or the error of the results 2 x =variance of the data for large population size 25 Deciding x 2 from previous surveys intentionally use a large number conservative estimation e.g. a 10-point scale; assume that responses will be found across the entire 10-point scale 3 to the left/right of the mean describe virtually the entire area of the normal distribution curve 2 =10/6=1.67; =2.78 (forcing 10 to be within − 3𝜎 𝑎𝑛𝑑 + 3𝜎) 26 Example Z x 2 n 2 e2 Z=1.96 (usually rounded as 2) 2 =2.78 e=0.2 n=278 (responses needed) assume response rate is 0.4 Sample size=278/0.4=695 27 Exercise Z x 2 n 2 e2 Z=1.96 (usually rounded as 2) 5-point scale (suppose most of the responses are distributed from 1-5) error tolerance=0.4 assume response rate is 0.6 What is sample size? 28 Sampling 29 How to collect data so that conclusions based on our observations can be generalized to a larger group of observations. Population: A group that includes all the cases (individuals, objects, or groups) in which the researcher is interested. Sample: A subset of cases selected from a population Parameter: A measure (e.g., mean or standard deviation) used to describe the population distribution. Statistic: A measure (e.g., mean or standard deviation) used to describe the sample distribution Sampling 30 Probability sampling 31 A method of sampling that enables the researchers to specify for each case in the population the probability of its inclusion in the sample. The purpose of probability sampling is to select a sample that is as representative as possible of the population. It enables the researcher to estimate the extent to which the findings based on one sample are likely to differ from what would be found by studying the entire population. Simple Random Sample 32 A sample designed in such a way as to ensure that 1) every member of the population has an equal chance of being chosen, 2)every combination of N members has an equal chance of being chosen. Example: Suppose we are conducting a costcontainment study of 10 hospitals in our region, and we want to draw a sample of two hospitals to study intensively. Systematic random sampling 33 A method of sampling in which every Kth member in the total is chosen for inclusion in the sample. K is a ratio obtained by dividing the population size by the desired sample size. Example: we had a population of 15,000 commuting students and our sample was limited to 500, so K=30. So we first choose any one student at random from the first 30 students, then we select every 30th student after that until reach 500. Stratified Random Sample A method of sampling obtained by 1) dividing the population into subgroups based on one or more variables central to our analysis, and 2) then drawing a simple random sample from each of the subgroups. Proportionate stratified sample: the size of the sample selected from each subgroup is proportional to the size of that subgroup in the entire population. 34 Disproportionate stratified sample The size of the sample selected from each subgroup is deliberately made disproportional to the size of that subgroup in the population 35 A sample (N=180), with 90 whites (50%), 45 blacks (25%) and 45 Latinos (25%). Sampling distribution Helps estimate the likelihood of our sample statistics and enables us to generalize from the sample to the population. But population in most of times unknown The sampling distribution is a theoretical probability distribution (which is never really observed) of all possible sample values for the statistics in which we are interested. 36 Sampling distribution If we select 3 of them, what will be the difference for mean and standard deviation? 37 Sampling distribution of the mean A theoretical probability distribution of sample means that would be obtained by drawing from the population all possible samples of the same size Mean Income of 50 Samples of Size 3 from 20 individuals 38 Sampling distribution of the mean 39 40 Standard error of the mean It describes how many variability there is in the value of the mean from sample to sample. It equals to the standard deviation of the population 𝜎𝑌 divided by the square root of the sample size, 𝜎𝑌 = 𝜎𝑌 𝑁 41 Central Limit Theorem If all possible random samples of size N are drawn from a population with a mean 𝜇𝑌 and a standard deviation 𝜎𝑌 , then as N becomes larger, the sample distribution of sample means becomes approximately normal, with mean 𝜇𝑌 equal to the population mean 𝜎𝑌 𝜇𝑌 and a standard deviation equal to 𝜎𝑌 = 𝑁 42 According to central limit theorem, N (>50, or >30) means that the sampling distribution of the mean will be approximately normal Estimation 43 A process whereby we select a random sample from a population and use a sample statistic to estimate a population parameter. Point estimate: A sample statistic used to estimate the exact value of a population parameter. Point estimate usually results in some sort of sampling error, therefore has less accuracy. Confidence interval (CI): A range of values defined by the confidence level within which the population parameter is estimated to all. Sometimes confidence intervals are referred as a margin of error. Confidence level: the likelihood, expressed as a percentage or probability, that a specified interval will contain the population parameter. Margin of error: the radius of a confidence interval. Estimation 44 Confidence intervals are defined in terms of confidence levels. A 95% confidence level, there is a 0.95 probability – or 95 chances out of 100- that a specified interval will contain the population mean. Most common confidence levels are: 90%, 95%, 99% Margin of error is the radius of a confidence level. So if we select a 95% confidence level, we would have a 5% chance of our interval being incorrect. Notation Mean Standard Deviation Sample Distribution 𝑌 𝑆𝑌 Population Distribution 𝜇𝑌 𝜎𝑌 Sampling distribution of 𝑌 𝜇𝑌 𝜎𝑌 𝐶𝐼 = 𝑌 ±Z(𝜎𝑌 ) • A total of 68% of all random sample means will fall within ±1 standard error (standard deviation) of the true population mean. (Z=±1) • A total of 95% of all random sample means will fall within±1.96 standard error (standard deviation) of the true population mean. (Z=±1.96) • A total of 99% of all random sample means will fall within±2.58 standard error (standard deviation) of the true population mean. (Z=±2.58) 45 Determining the confidence interval Follow these steps Calculate the standard error (standard deviation) of the mean Decide on the level of confidence, and find the corresponding Z value Calculate the confidence interval Interpret the results 46 Example To estimate the average commuting time of all 15,000 commuters on our campus (the population parameter), we survey a random sample of 500 students, and sample mean (𝑌) is 7.5 hrs/week. Step 1: Calculate the standard error (standard deviation) of the mean Let’s suppose the standard deviation for the population 𝜎𝑌 =1.5 𝜎𝑌 = 47 𝜎𝑌 1.5 = =0.07 𝑁 500 Example Step 2: Decide on the level of confidence, and find the corresponding Z value Let’s take 95% confidence level, so Z=±1.96 Step 3: Calculate the confidence interval 𝐶𝐼 = 𝑌 ±Z(𝜎𝑌 )=7.5±1.96 0.07 = 7.5 ± 0.14 = 7.36 𝑡𝑜 7.64 Step 4: Interpret the results 48 We can be 95% confident that the actual mean commuting time – the true population mean – is no less than 7.36 and no greater than 7.64 hrs per week. There is a 5% risk that we are wrong, which means if we collect a large number of samples (N=500), that five samples out of 100 samples, the true population mean will not be included in the specified interval. Example If we do 10 different samples, with 95% confidence level and come out with the confidence interval, only 1 out of the 10 confidence intervals does not intersect with the vertical line which is the true population mean 49 Estimating Sigma 50 Both the mean (𝜇𝑌 ) and the standard deviation (𝜎𝑌 ) of the population are unknown. When N is more than or equals to 50, the sample standard deviation (𝑆𝑌 ) is a good estimate of standard deviation of the population (𝜎𝑌 ) 𝜎𝑌 = 𝜎𝑌 𝑆𝑌 = 𝑁 𝑁 Example We will estimate the mean hours per day that Americans spend watching TV based on the 2010 GSS. The mean hours per day spent watching TV for a sample of N=1013 is 𝑌 =3.01, and standard deviation 𝑆𝑌 =2.65 hrs. Let’s use the 95% confidence interval 𝜎𝑌 𝑆𝑌 2.65 = = =0.08 𝑁 𝑁 1013 𝜎𝑌 = Z value for the 95% confidence interval is 1.96 𝐶𝐼 = 𝑌 ±Z(𝜎𝑌 )=3.01±1.96(0.08)=3.01±0.16 = 2.85 𝑦𝑜 3.17. We are 95% confident that the actual mean hours spent watching TV by Americans from which the GSS sample was taken is not less than 2.85 and not greater than 3.17. 51 What affects confidence interval width If other factors do not change 52 If the sample size goes up, the width gets smaller If the sample size goes down, the width gets bigger If the value of the sample standard deviation goes up, the width gets bigger If the value of the sample standard deviation goes down, the width gets smaller If the level of confidence goes up (95% to 99%), the width gets bigger If the level of confidence goes down (99% to 95%), the width gets smaller.