* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Definition of Non-Parametric Statistics:
Survey
Document related concepts
Transcript
Definition of Non-Parametric Statistics Non-parametric statistics are a branch of statistics that are applied when populations are not normal, or there are severely skewed data. Titles of Non-parametric Tests • • • • • One Sample Median Test Two Sample Location Test Two Sample Dispersion Test One-Way Layout Independence Test Focus: Median tests This presentation will cover: • What median tests are • Why they are used • When they are used • How they are used 30 25 20 % earning specified amount 15 10 5 0 $0$1000 $2001$3000 $4001$5000 $6001$7000 What are median tests? • They are tests similar to the mean tests covered in a college introduction to statistics. • They include confidence intervals, and significance tests. When to use a median test: (as opposed to a mean test) •When data or population does not fulfill conditions for mean tests. •The ONLY condition is a simple random sample! Remember these conditions? •30>n>15 with slight skewness •N>30 •Or population is normal They are NOT necessary! Why do we use median tests? Because they are more robust! Medians are more robust than means SRS of salaries of Company A: 1 $18,000 8 $35,000 2 $20,000 9 $36,000 3 $23,000 10 $50,000 4 $23,000 11 $50,000 5 $23,000 12 $60,000 6 $28,000 13 $130,000 7 $30,000 14 $1,000,000 •The mean of these salaries is $109,000 •The median of these salaries is clearly between #7 and #8, or $32,500 Just from looking at the list of salaries, the median seems to describe the middle of the distribution much more accurately, since salary #14 pulls the mean so far up More robustness The rest of the procedure of the median test is more robust than the t-distribution. This combination of a robust statistic and robust procedure allows for statistical inference on very skewed data. Confidence Intervals for Medians The two main types: •Exact: needs tables and or computer software •Approximate: simpler tables, appropriate for larger samples We will concentrate on the approximations Introduction to the Confidence Intervals It is necessary to understand “rank” The rank of a value in a distribution is simply its numbered place in the list of ordered values Example: in the distribution of letters {a, b, c, d, e, f} “b” has a rank of 2 from the left, and a rank of 5 from the right. Steps for Approximate Confidence Intervals 1. Order the distribution from smallest to largest values 2. Find the median of the distribution. 3. Find the rank* of each limit depending on the sample size from a table like the one shown on the next slide. 4. Take the rank number and count in that many data points from each side of the ordered data. * Note that these ranks are computed by complicated formulas, then put neatly into a table for users, and treated like the definition of rank seen before. Ranks for non-parametric 95% confidence intervals* Sample Size Rank Sample size rank 8 1 21 6 9 2 22 6 10 2 23 7 11 2 24 7 12 3 25 8 13 3 26 8 14 3 27 8 15 4 28 9 16 4 29 9 17 5 30 10 18 5 31 10 19 5 32 10 20 6 33 11 * Values taken from Siegel’s Statistics and Data Analysis Example: Using the same salary data from before, with This is the lower sample size 14 and confidence limit rank 3, proceed as of the interval follows So, the 95% confidence interval is ($23000, $60000) 1 2 3 This is the upper confidence limit of the interval 3 2 1 1 $18,000 2 $20,000 3 $23,000 4 $23,000 5 $23,000 6 $28,000 7 $30,000 8 $35,000 9 $36,000 10 $50,000 11 $50,000 12 $60,000 13 $130,000 14 $1,000,000 Significance test for medians Remember duality? “What is not contained in the confidence interval is significant at the same alpha-level.” This property of confidence intervals can be used to test for significance. Steps for Significance Test at alpha=.05 1. Create a confidence interval at this alphalevel. 2. Check to see if the accepted population value is included in interval. 3. Draw Conclusion: – – If value IS included sample is NOT significant If value is NOT includedsample IS significant Sample Significance Test Assume that the commonly accepted median of salaries at company A is $53,000, and that the sample shown before was drawn. Test hypotheses • Ho: M=$53,000 or that the true median of salaries in company A is $53,000. • Ha: M≠$53,000 or that the true median of salaries in company A is NOT $53,000. Our previous 95% confidence interval was ($23000, $60000), so: •the accepted median, $53,000, is within the interval, •The outcome is not significant, •We do not reject the accepted median. Mean Tests VS. Median Tests Consider a population of children, with a distribution of the number of toys each one has. •True mean Mu of 7.3 toys per child •True median M of 7 toys per child 2 SRS’s from the Population of Children # of children 9 8 7 6 5 4 3 2 1 0 0 2 4 6 8 10 12 14 16 18 20 14 16 18 20 # of toys 9 8 # of children 7 6 5 4 3 2 1 0 0 2 4 6 8 10 # of toys 12 Both look very similar. The only difference is the movement of one bar, to be a far out outlier. Sample 1: 95% Mean Confidence Interval Sample 1, with no outlier 9 8 7 6 5 4 3 2 1 0 0 • • • • • (use calculator 1-var stats) Sample mean x-bar=7.1 toys Sample standard deviation=1.9877 Sample size n=28 Sigma of x-bar=1.9877/√28=.3756 Z-score z*=1.95996 2 4 6 8 10 12 14 16 18 20 • CI: 7.1+/-(1.95996*.3756): (6.358, 7.842) Sample 1: 95% Median Confidence Interval Sample 1, with no outlier 9 8 7 6 5 4 3 2 1 0 0 • • • • • 2 4 6 8 10 12 14 16 18 20 Sample median=7 toys Sample size n=28 Rank (see table) =9 Lower confidence limit=6 Upper confidence limit=7 • CI: (6, 7) Sample 2: 95% Mean Confidence Interval 9 Sample 2, with outlier 8 7 6 5 4 3 2 1 0 0 • • • • • (use calculator 1-var stats) Sample mean x-bar=8.4 toys Sample standard deviation=4.8722 Sample size n=28 Sigma of x-bar=4.8722/√28=.9208 Z-score z*=1.95996 2 4 6 8 10 12 14 16 18 20 • CI: 8.4+/-(1.95996*.9208): (6.595, 10.205) Sample 2: 95% Median Confidence Interval 9 Sample 2, with outlier 8 7 6 5 4 3 2 1 0 0 • • • • • 2 4 6 8 10 12 14 16 18 20 Sample median=7 toys Sample size n=28 Rank (see table) =9 Lower confidence limit=6 Upper confidence limit=7 • CI: (6, 7) These statistics match up EXACTLY with the median CI for the first sample. The outlier did not affect the outcome, demonstrating the test’s robustness. Comparison of different intervals Median CI (6,7) Mean CI (6.358, 7.842) 9 8 7 Sample 1 6 5 4 3 2 1 0 0 2 4 6 8 10 12 14 16 18 20 Median CI (6, 7) Mean CI (6.595, 10.205) 9 8 Sample 2 7 6 5 4 3 2 1 0 0 2 4 6 8 10 12 14 16 18 20 Discussion of differences • The outlier pulled the mean confidence interval to be much larger, making it less useful • The median interval stayed the same, and capture the true median very closely (as 7 is captured from 6 to 7) Conclusion When data is skewed, a median test can be much more useful than a mean test in estimating the true parameter.