* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Lecture 4: Confidence intervals, case selection, T
Degrees of freedom (statistics) wikipedia , lookup
Mean field particle methods wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
Confidence interval wikipedia , lookup
Taylor's law wikipedia , lookup
Student's t-test wikipedia , lookup
Misuse of statistics wikipedia , lookup
SPS 580 Lecture 4 CI Case selection outliers T I. PRECISION OF DATA -- The “plus or minus” Research all about analyzing the mean or the percentage. When you take a random sample for a survey how much confidence do you have in the mean, or the percentage? The level of confidence is expressed as the “plus or minus” that goes along with the result. Also called the “margin of error”. SCALE NAME ID SCORE 1 JACOB ADAMS 3.0 2 CYNTHIA AVERY 6.0 3 MARK BIERY 3.0 4 ABRAHAM BOL 9.0 5 PRASID DHITAL 0.0 6 PAUL DOMBROSKI 5.0 7 KATHRYN DUVAL 1.0 8 BROOKE EISENMENGER 0.0 9 MOLLY FANNIN 3.0 10 CRYSTAL GARDNER 2.0 11 SIMONE GOURGUECHON 2.0 12 SHAWN JANZEN 2.0 13 AMANDA MAHONEY 1.0 14 CLAIRE MARCH 7.0 15 ELLEN MCELLIGOTT 6.0 16 WILMAR MOLINA 2.0 17 MEGAN MORROW 1.0 18 KATERI NELIS 5.0 19 JOANNA OUBRE 4.0 20 PATRICK POUNTNEY 6.0 21 AMANDA RIORDAN 9.0 22 ANNE SAWKIW 10.0 23 REBECCA WILSON 4.0 24 JONATHAN WITTIG 0.0 25 JADA WOLLENZIEN 26 EDWARD ZEHME Average 3.25 variance (N-1) 8.92 N 4.00 SEM 1.49 +/- 2.93 low Illustration: I designed a questionnaire with a scale of 0-10 to determine whether the students in this class agree or disagree with <something really important>. A score of 5 is neutral; above that = favorable opinion; below that = unfavorable opinion. I’d like to know the average score in the class (aka the TRUE MEAN). Universe = the 26 students in SPS 580, “Scale Score” shows what each person would say if they were asked the question. There is a “TRUE MEAN” -- I don’t know what it is, that’s why I’m doing the survey. I didn’t have enough money to survey everyone in the course (i.e., conduct a census). I only have enough money to survey 4 people. SCALE SAMPLE 1 SCORE SHAWN JANZEN 2.0 CLAIRE MARCH 7.0 3.0 REBECCA WILSON 4.0 0.0 EDWARD 0.0 ZEHME So I randomly selected 4 people and interviewed them. These are the answers I have for my survey. And here’s the results from my data analysis: the observed mean is 3.25 STATISTICS ALERT . . . MEAN = Sum(x) / n = 13/4= 3.25 The 95% confidence interval equals “Observed Mean” +/- 2.93 … which means I am 95% certain that the TRUE MEAN – i.e., average score for all 6.18 students in the class (for the UNIVERSE) is between 0.32 and 6.18 STATISTICS ALERT . . . 0.32 high 95% Confidence Interval = +/- 1.96 * Standard Error of the Mean = +/- 1.96 * 1.49 Standard Error of the Mean = Square Root( Variance / n ) = Sqrt (8.92/4) = 1.49 Variance = Sum of ( (individual score – MEAN)^2 ) / (n-1) = Sum ( (x – 3.25)^2 ) / 3 = 8.92 x (x-mean) (x-mean)^2 2.0 -1.25 1.5625 7.0 3.75 14.0625 4.0 0.75 0.5625 0.0 -3.25 10.5625 SUM SUM/(n-1) 26.75 8.92 The variance has to do with the amount of VARIETY in the scores – it bounces around the same value regardless of how many people you interview The standard error of the mean has to do with the variance and the SAMPLE SIZE, it gets smaller if the sample gets larger. 1 SPS 580 Lecture 4 CI Case selection outliers T A. The meaning of the 95% confidence interval . . . The 95% CI is a way of saying we are 95% certain that the “REAL MEAN” – i.e., the one we would get if we surveyed everybody - is within the interval . . . “Observed Mean” +/- 1.96 * SEM WHERE IS THE "REAL MEAN" 0 10 Observed mean= 3.25 95% CI = 0.32 ………………………..6.18 95% of what? Well, if we did 100 surveys with the same sample size, then 95% of the time – i.e. 95 times out of 100, the 95% confidence interval will contain the “TRUE MEAN” SAMPLE 2 Scores GOURGUECHON 2.0 Avg 2.75 AMANDA MAHONEY 1.0 SEM 0.85 KATERI NELIS 5.0 +/- 1.67 JADA WOLLENZIEN SAMPLE 3 To test this, I did four more surveys, based on a random sample of the same size, from the same universe Results SIMONE 3.0 These are the results of surveys 2,3,4, and 5. CRYSTAL GARDNER 2.0 Avg 2.50 CLAIRE MARCH 7.0 SEM 1.55 MEGAN MORROW 1.0 +/- 3.05 JONATHAN WITTIG SAMPLE 4 0.0 ABRAHAM BOL 9.0 Avg 5.25 PRASID DHITAL 0.0 SEM 2.25 MOLLY FANNIN 3.0 +/- 4.41 Here are the mean and 95% CI for each of the 5 samples . . . Mean and 95% CI for 5 samples from same Universe 95% Low Mean 95% High Sample 1 0.32 3.25 6.18 Sample 2 1.08 2.75 4.42 AMANDA RIORDAN SAMPLE 5 9.0 Sample 3 -0.55 2.50 5.55 JACOB ADAMS 3.0 Avg 4.75 Sample 4 0.84 5.25 9.66 PAUL DOMBROSKI 5.0 SEM 0.85 CLAIRE MARCH 7.0 +/- 1.67 Sample 5 3.08 4.75 6.42 REBECCA WILSON 4.0 10.00 From the TOTAL data base we calculate that the “TRUE MEAN” is 3.60 TRUE MEAN = 3.60 8.00 However, in a research setting you don’t know this, you just have an observed sample mean and a 95% confidence interval 6.00 4.00 In my 5 samples the 95% CI included the “TRUE MEAN” every time. If I had done 100 samples, I would expect that the 95% CI included the true mean 95 times 2.00 0.00 -2.00 Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 95% Confidence Interval 95% Low Mean 95% High 2 SPS 580 Lecture 4 CI Case selection outliers T B. Interpreting the 95% confidence interval In my survey I found out that the “True Opinion” is likely (95%) to be between .32 and 6.18 WHERE IS THE "REAL MEAN" 0 10 Observed mean= 3.25 95% CI = 0.32 ………………………..6.18 Q: So how helpful was my survey? A: Not very – Another way to say it is that I’m 95% certain that the “True opinion” is either negative (< 5), neutral (5), or positive (>5). Q: How could I do a survey that is more helpful? A: Increase the sample size. II. PRECISION OF PERCENTAGES When Y is a dichotomy coded (0,1) , the mean is the proportion in category 1. Don’t believe me . . .. add up these 10 responses and divide by 10 to get the average (mean) person 1 2 3 4 5 6 7 8 9 10 score 0 1 1 1 1 0 1 0 0 1 sum of scores 6 n 10 average = proportion(1) = p 0.60 variance = p(1-p) 0.24 SEM = sqrt ( SEM / n ) 0.155 +/- = 1.96 * SEM 0.304 30.4% 95% CI = p +/- 1.96 * SEM 60.0% low 0.296 29.6% p 0.600 60.0% high 0.904 90.4% mean = proportion coded (1) also can be expressed as % coded (1) formula for variance is simpler formula for SEM is THE SAME formula for +/is THE SAME formula for 95% CI is THE SAME WHERE IS THE "REAL MEAN" 0% Observed mean= 95% CI = 100% 60.0% 29.6% …………..………..90.4% Q: Is the majority opinion in the class above or below 50% ? A: I don’t know, but I’m sure it’s between 29.6% and 90.4% !!! Q: What can you do to make this more precise? A: LOOK AT ASSGT 4 Part 1 3 interpretation of results is THE SAME SPS 580 Lecture 4 CI Case selection outliers T III. EXPLORING Confidence Intervals with Live Data WBEZ marketing committee wants to know how to increase revenue from its younger audience Listenership, familiarity w/WBEZ Membership in NFPs Usual payment for membership A. what % listen to WBEZ radio station? FREQUENCIES VARIABLES= wbezrng /ORDER=ANALYSIS. wbezrng Amount of Time Listening to WBEZ Frequency Valid 1 Know It, Don't Listen 408 2 Listen < 1 hr/day 462 3 Listen > 1hr/day 299 4 Not Familiar 1809 5 Don't Listen to Radio 8 Don't Know Total Missing System Total WBEZ familiarity, listenership, adult population Don’t listen to radio 247 Not familiar with WBEZ 1,809 Familiar, don't/DK listen 424 Listen to WBEZ 761 3,241 247 16 3241 33449 36690 A. NOTE: the marketing committee wants to target its research on population 45 and under. 1. Define a selection variable . . . RECODE age01 (10 thru 45=1) (46 thru 98=2) (ELSE=9) INTO AGE2. VARIABLE LABELS AGE2 'age2 '. VALUE LABELS age2 1 '18- 45' 2 '46+' 9 'not valid'. MISSING VALUES age2 (9) . 2. Select that data only for analysis . . . DATA / SELECT CASES / IF CONDITION IS SATISFIED / IF age2=1 OK/PASTE Other selection variables mentioned in SPS570/580. . . transit riders, low income Univariate selection variables (for now) . . . typology construction later USE ALL. COMPUTE filter_$=(AGE2 = 1). VARIABLE LABEL filter_$ 'AGE2 = 1 (FILTER)'. VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'. FORMAT filter_$ (f1.0). FILTER BY filter_$. NEW DATA FOR AGE 18-45 ONLY Potential listeners Current listeners WBEZ familiarity, listenership, population 18-45 Don’t listen to radio 110 Not familiar with WBEZ 1,125 % Familiar, don't/DK listen 261 13% Listen to WBEZ 447 23% 1,943 36% 4 +/1.5% 1.9% 2.1% SPS 580 Lecture 4 CI Case selection outliers T B. What % belong to non-profit arts or cultural organizations? RECODE mems1 mems10 mems11 mems12 mems13 mems2 mems3 mems4 mems5 mems6 mems7 mems8 mems9 (1=1) (2=0) (8=0). VALUE LABELS mems1 mems10 mems11 mems12 mems13 mems2 mems3 mems4 mems5 mems6 mems7 mems8 mems9 0 ' no DK ' 1 ' yes member'. COMPUTE ARTCULTMEMBERSHIPS = mems1+ mems2+ mems3+ mems4+ mems5 + mems6 + mems7 + mems8 + mems9+ mems10+ mems11+ mems12+ mems13. KEEP THE SELECTION VARIABLE OPERATING FREQUENCIES VARIABLES=ARTCULTMEMBERSHIPS /ORDER=ANALYSIS. ARTCULTMEMBERSHIPS Frequency Valid Valid Percent .00 1434 81.8 1.00 245 14.0 2.00 57 3.2 3.00 13 .7 4.00 4 .2 6.00 1 .1 Total 1754 100.0 Inspection => (0,1) utility only One or more memberships (1 – 6) = 18% +/- 1.8% Note: you can get different data points on the same population even though memberships and WBEZ listenership are not asked the same year C. How much are people willing to pay for a membership in an arts/cultural organization? MEMSPED How much would you be willing to spend for a one-year family membership in one of these types of organizations? memsped How Much for One-Year Family Membership dollars frequency dollars frequency $0 174 $65 7 dollars frequency $1 4 $67 1 $225 $2 1 $68 1 $250 5 $5 24 $70 10 $300 11 $7 1 $75 43 $350 2 $10 35 $80 4 $360 1 $15 13 $85 2 $400 4 $20 93 $89 1 $500 10 $25 106 $90 1 $501 1 $29 1 $100 251 $745 1 $30 54 $110 1 $1,000 4 $35 25 $115 1 $1,200 1 $40 41 $120 4 $1,500 1 $45 7 $125 3 $2,000 2 $50 346 $150 26 $5,000 1 $55 4 $175 1 $5,200 1 $60 33 $200 58 $6,000 1 9998 DK 324 MEAN = $82 +/- $14 1 5 anybody see any problems here? Anything above $500 is quite possibly a mistake Anything above $200 is an OUTLIER Rule of thumb: values more than 150% distance from the mean are going to cause trouble SPS 580 Lecture 4 CI Case selection outliers T D. Deal with OUTLIERS one of two ways a. Work with the median (as opposed to the mean). . . Between $45 and $50 -- where the 50th percentile falls b. But the median can’t be used for many statistical procedures c. So we RECODE OUTLIERS to an acceptable maximum value and then re-calculate the mean RECODE memsped (225 thru 6000=200) (ELSE=Copy) INTO memsped2. VARIABLE LABELS memsped2 'money for membership'. MISSING VALUES MEMSPED2 (9997, 9998) . FREQUENCIES VARIABLES= MEMSPED2 /ORDER=ANALYSIS. DESCRIPTIVES VARIABLES=MEMSPED2 /STATISTICS=MEAN STDDEV VARIANCE MIN MAX SEMEAN. How Much for One-Year Family Membership w/o OUTLIERS dollars frequency dollars frequency $0 174 $65 7 $1 4 $67 1 $2 1 $68 1 $5 24 $70 10 $7 1 $75 43 $10 35 $80 4 $15 13 $85 2 $20 93 $89 1 $25 106 $90 1 $29 1 $100 251 $30 54 $110 1 $35 25 $115 1 $40 41 $120 4 $45 7 $125 3 $50 346 $150 26 $55 4 $175 1 $60 33 $200 105 6 MEAN = $60 +/- $3 Look at the impact of trimming the outliers $82 vs. $60 Rule of thumb: the reason you trim outliers is that means (and lots of other really important statistics) are VERY STRONGLY INFLUENCED by extreme values. You get a more stable, and therefore more accurate picture of the REAL WORLD by trimming (not eliminating) outliers. SPS 580 Lecture 4 IV. CI Case selection outliers T Should WBEZ market differently in the city vs. suburbs? RECODE region (1=1) (2 thru 7=2) (ELSE=9) INTO region2. VARIABLE LABELS region2 'region recoded'. value labels region3 1 'Chicago' 2 'Suburbs'. missing values region2 (9). KEEP THE SELECTION VARIABLE OPERATING A. Is WBEZ listenership higher in the city or in the suburbs? 1. Don’t listen to radio 2. Not familiar with WBEZ 1 Chicago 11% 49% 12% 29% 766 0 Suburbs 2% 64% 15% 19% 1177 6% 58% 13% 23% 1943 Total 3. Familiar, 4. Listen to don't/DK listen WBEZ Total Chi Sq(3) = 98 Phi = .22 But chi square, phi are blanket tests, WBEZ wants to know specifically about listenership Place Does Place Predict WBEZ Listenership ? Mean Std Err +/CI(Low) Mean CI(High) 1 Chicago 0.29 0.0163 0.0320 0.25 0.29 0.32 0 Suburbs 0.19 0.0115 0.0226 0.17 0.22 Difference = 0.09 0.19 0.09 What is the CONFIDENCE INTERVAL for the difference of means If it includes the value ZERO then the difference of the means is NOT SIGNIFICANT STATISTICAL THEORY ALERT . . . Mean 1 has its uncertainty (SEM1) Mean 2 has its uncertainty (SEM2) Logical conclusion Wouldn’t it make sense that the uncertainty of the difference is equal to the sum of the two uncertainties? Well it is, sort of . . . STATISTICS ALERT STD ERROR of DIFFERENCE OF 2 Means = SQRT ( SEM1^2 + SEM2^2 ) 95% CONFIDENCE INTERVAL for the difference of means = +/- 1.96 * SEDiff 1 Chicago 0 Suburbs CI(Low) 25% 17% 5% Mean 29% 19% 9% DIFFERENCE of Means STD ERROR of DIFFERENCE CI(High) 32% 22% 13% = T - Test the CI(Diff) does NOT include ZERO, so we conclude that there is a SIGNIFICANT DIFFERENCE in listenership by place . . . In the city, listenership is 10% higher than in the suburbs Another way to look at this is that the difference is significant if the t-test > 1.96 (or < -1.96 for negative differences) df = INFINITE 7 SPS 580 Lecture 4 CI Case selection outliers T E. Is the percent who belong to nfp arts/cultural organizations higher in the city? KEEP THE SELECTION VARIABLE OPERATING region2 region recoded * ARTCULTMEMBERSHIPS Crosstabulation % within region2 region recoded ARTCULTMEMBERSHIPS .00 region2 region recoded 1.00 2.00 3.00 4.00 6.00 .00 Suburbs 80.5% 15.2% 3.5% .4% .3% 1.00 Chicago 83.2% 12.5% 3.0% 1.1% .1% .1% 100.0% 81.8% 14.0% 3.2% .7% .2% .1% 100.0% Total 100.0% grouping together to focus on any memberships vs. none 1 Chicago 0 Suburbs Difference SE(Diff) T critical value of T Not significant Any memberships 0.17 0.19 -0.03 0.02 -1.42 -1.96 N p*(1-p)/n 808 946 sum --> sqrt --> 0.0001733 Chi square (5) = 7.7 0.0001656 phi = .077 0.0003389 0.0184083 p > .05 Not Sig, answer is NO, % who belong is same in city and in suburbs F. Is the amount people are willing to pay for a membership higher in the suburbs? Report memsped2 money for membership ANALYZE/ COMPARE MEANS / MEANS / Dependent Memsped2 / Independent Region2 / OPTIONS Mean Std Error of Mean / region2 region recoded Avg. spend for memberships .00 Suburbs $59.88 1.00 Chicago $60.39 Difference -$0.51 2.8316 -0.18 -1.96 T critical value SEM $1.8197 $2.1694 SUM --> SQRT --> Std. Error of Mean dimension1 SEDifference Total Mean .00 Suburbs 59.8833 1.81970 1.00 Chicago 60.3991 2.16942 Total 60.1166 1.39812 SEM^2 3.3113 4.7064 8.0177 $2.8316 Not Sig, answer is NO, people pay the same in the suburbs as in the city Not significant 8 SPS 580 Lecture 4 CI Case selection outliers T Assignment 4: Part 1: The Excel spreadsheet for Assignment 4 contains a list of the students in SPS 580 and their opinions on two really important issues. Opinion Item 2 is measured on a (1,10) scale. Opinion Item 3 is measured as a (0,1) dichotomy. 1. Randomly select 10 people from the list; analyze the scores for the answers they gave to Opinion Item 2 and Opinion Item 3. 2. For Opinion Item #2: What is the observed mean, the variance, the SEM, the 95% CI, what do you conclude from your survey? 3. For Opinion Item #3: ditto Part 2: Define a policy research problem on a TARGETED POPULATION using a univariate selection variable (recoded, but no typologies) 1. TARGET POPULATION: Use PASW to select the targeted population, describe how this is done 2. DEPENDENT VARIABLES . . . Define one dichotomous (0,1) outcome variable (Y1), and one interval scale outcome variable (Y2) -- can be a scale you compute or an interval variable on the data set a. For Y1 what is the 95% Confidence Interval for the percent b. For Y2 . . . i. is there a need to trim outliers, take the necessary action, explain it ii. What are the low/high (trimmed) values, what is the mean and the 95% Confidence Interval for the mean? 3. INDEPENDENT VARIABLE: Define a (0,1) dichotomous independent variable (X1) that classifies the target population according to a characteristic of policy interest, explain the variable and categories a. What is the theory being tested for X1 Y1 b. Crosstabulate X1 and Y1, show a PQ table of percents, with added columns/rows as needed to show the steps in calculating a T-test for the difference in percentages c. What do conclude from the data and the T-test? d. What is the theory being tested for X1 Y2 e. Calculate a table of means for X1 and Y2, show a PQ table of means, with added columns/rows as needed to show the steps in calculating a T-test for the difference in means f. What do conclude from the data and the T-test? 9 SPS 580 Lecture 4 CI Case selection outliers T 10