* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Lecture #4
Survey
Document related concepts
Transcript
Action Research Data Manipulation INFO 515 Glenn Booker INFO 515 Lecture #4 1 Inferential Statistics Introduction We often want to estimate some statistic for a large population (e.g. average age, percent customer satisfaction, average weight, etc.) A common way to do so is by taking one or more samples (such as surveys of people, or measuring a handful of nails from a production batch) and measuring that statistic for each sample INFO 515 Lecture #4 2 Inferential Statistics Introduction We distinguish between the characteristics of a sample versus the characteristics of the entire population INFO 515 If you survey 30 Drexel students and ask their ages, your average (the sample statistic) probably won’t equal the average of all Drexel students (the population statistic) If you repeat that survey many times, you will get a range of sample statistics (the average ages from each sample) Lecture #4 3 Sampling Terms “Statistic” means a value derived from a sample “Parameter” is a value derived from the entire population (often not known) The “values” may include: means, standard deviations, percentages, correlation coefficients, regression coefficients, etc. INFO 515 Lecture #4 4 Types of Distributions Sample distribution: the distribution of data in our sample for some variable of interest Population distribution: the distribution of the whole population for some variable of interest Sampling distribution: theoretical distribution of an infinite number of samples; its mean is the population mean INFO 515 Lecture #4 5 Sampling Distribution of a Statistic It’s a theoretical distribution comprising the outcomes of drawing infinitely repeated samples Is NOT the distribution of one sample for some variable (e.g. one set of scores) We don’t actually draw an infinite number of samples! INFO 515 We draw one sample and it gives us a point estimate for the whole population Lecture #4 6 Point Estimate vs Interval Estimate From a sample statistic (such as a mean, percentage, etc.), we try to estimate a population parameter that we don’t (maybe even can’t) know This is called a point estimate of that statistic But we know that samples can sometimes be unrepresentative through random error INFO 515 So we find a confidence interval, to make a more realistic interval estimate Lecture #4 7 Point Estimate vs Interval Estimate A point estimate is more likely to be reported to the public (e.g. average income in a township) because people tend to understand averages and individual numbers An interval estimate (such as a confidence interval) is useful in decision-making and is often used with the test of hypothesis (discussed shortly) INFO 515 Lecture #4 8 Sampling Distribution Could the point estimate ever be the true value? Sure! Nature is not usually perverse, but we seldom know the true value Do we expect to get the true value always? No! Erroneous estimates would theoretically occur as often as estimates which are right on target INFO 515 Example: average age of Philly residents. Census would produce true mean age; samples would produce varying estimates of true mean age Lecture #4 9 Central Limit Theorem The Central Limit Theorem helps validate sample statistics Many (e.g. 30+) random samples from a given population will have means which are close to the true mean of the entire population The means of those samples will be normally distributed about the true mean of the population INFO 515 Lecture #4 10 Central Limit Theorem When many samples of sufficient size are randomly drawn from a population, the values of the sample means will tend to cluster about the true value of the population mean, and this clustering will be normally distributed The mean of a sampling distribution will be a very good estimate of the true population mean INFO 515 Lecture #4 11 Hypothesis Testing INFO 515 Lecture #4 12 Hypothesis Testing There are Research and Null Hypotheses The Research Hypothesis, a.k.a just Hypothesis, is the relationship you think exists, or the statement you wish to prove INFO 515 “Computer aided instruction will affect math scores compared to standard instruction” “Library patrons trust electronic sources more or less than paper sources” Lecture #4 13 Null Hypothesis Null (statistical) Hypothesis: The Null hypothesis is the opposite of the Research Hypothesis – is generally the assumption that no significant differences exist INFO 515 “There is no difference between computer aided instruction and traditional teaching in their effect on math scores” “Library patrons trust electronic and paper sources equally” Lecture #4 14 Hypothesis Testing is Weird In statistics, you can never prove anything You can only try to reject the null hypothesis Because statistics are based on probability - which implies uncertainty – we can never absolutely disprove the null hypothesis We can only provide evidence that differences are not due to chance INFO 515 Lecture #4 15 Risks A risk in rejecting, or failing to reject, the Null Hypothesis is that we could be making one of two types of errors: Type I Error We reject the null when it is actually true, saying something is significantly different when it is not Type II Error We accept the null when it is really false, saying something is not significantly different when it is Honest, this is what they’re called. I’m not making it up! INFO 515 Lecture #4 16 Types of Error Reality Null hypothesis really is True Null hypothesis really is False Your result You reject the null hypothesis Type I error You are correct You do not reject the null hypothesis You are correct Type II error From Norusis, Guide to Data Analysis, p. 256 INFO 515 Lecture #4 17 Choosing a Significance Level We select a significance level (Sig.) as a criterion for rejecting a null hypothesis INFO 515 A significance level of 0.05 or less is generally suitable for the social sciences and corresponds to a z value of 1.96 A significance level of 0.01 or less is used in clinical trials, drug studies, etc. and corresponds to a z value of 2.57 Lecture #4 18 Significance versus Confidence The significance level is the complement of the confidence level (level of confidence) significance level + confidence level = 1 A significance level of 0.05 corresponds to a confidence level of 95% A significance level of 0.01 corresponds to a confidence level of 99% INFO 515 Lecture #4 19 Sampling Error Sampling errors are deviations of a sample statistic from the true population value, produced by chance If deviations aren’t caused by chance, they are called systematic errors or bias Bias is introduced through failure to randomize fully, prejudices of interviewers, lies, evasions, memory lapses, defective measuring instruments, and many other causes INFO 515 Lecture #4 20 Standard Errors Standard Error is “A measure of how much the value of a test statistic may vary from sample to sample. It is the standard deviation of the sampling distribution for a statistic. For example, the standard error of the mean is the standard deviation of the sample means.” SPSS INFO 515 Lecture #4 21 Standard Errors How do you know if the sample you picked is weirder than normal? Standard errors are estimates of sampling error INFO 515 Lecture #4 22 Standard Error of the Mean If we are trying to estimate the population mean from a sample, the standard error of the mean is computed thus: SEx = s/(n) SEx (not a typo) means “Standard Error of x” s is the standard deviation of your sample, and n is your sample size It can also be written with the variance, rather than the standard deviation SEx = (s2 /n) [s2 is the variance] INFO 515 Lecture #4 23 Standard Error of the Mean As n increases in the formula s/n, the standard error gets smaller The larger we make our sample, the smaller is our margin of error in estimating the true value If a sample n grew to equal the whole population N, we’d be gathering data from the whole population, and so make no error of estimate at all! (SEx = 0 as n ∞) INFO 515 Lecture #4 24 Confidence Interval CI is an estimated range of values with a given probability of including the true population value These are limits within which we would expect repeated sample means to fall The usual probability level is 95%, but we can use 99% At 95% confidence, we are saying that: INFO 515 If we took repeated samples at random, constructed confidence intervals around the sample means, the true population mean would be captured in 95 out of 100 of those intervals Lecture #4 25 Confidence Interval Example The average blah for high school teachers 10 years ago was 30 I don’t believe this is true any more Take a random sample to measure blah and infer to the population INFO 515 Lecture #4 26 CI Example Given sample with n=25, mean=11, s=20 Hypothesis: Is this sample different from a population with a mean of 30? Null hypothesis: There is no difference between the population mean and our sample mean (i.e. the average blah could still be 30) INFO 515 Lecture #4 27 CI Example Calculate standard error of the mean SEx = s/n = 20/5 = 4 Calculate the 95% confidence interval CI = mean statistic (critical value)*(standard error of the mean statistic) CI 95 = 11 (1.96)*(4) CI 95 = (18.84, 3.16) INFO 515 Lecture #4 28 CI Example We can find the exact location of our sample mean with this formula for z: z = (sample mean - population mean)/ (sample standard error) z = (11-30)/4 = -4.75 This z is further from zero than -1.96 (the critical z value for .05 level of significance), so reject the null hypothesis INFO 515 Lecture #4 29 CI Example Conclusion: So we can reject the null hypothesis, and state that it is unlikely we would get a sample mean as low as 11 if the sample truly came from a population with a mean of 30 INFO 515 There is a statistically significant difference between our population mean of 30 and our sample mean At the 5% significance level, there is a five percent chance that we wrong to reject the null hypothesis Lecture #4 30 Hypothesis Testing to Compare a Sample to the Population Mean 1. 2. 3. 4. 5. INFO 515 Action Research State hypotheses: p. 22/23 Research hypothesis Null hypothesis Choose confidence or significance level Take sample from population Find mean (x), standard deviation (s) of that sample Standard error of mean SEx = s/(n) Lecture #4 31 Hypothesis Testing 6. 7. 8. INFO 515 For n > 30, use ‘z’. For n <= 30, use ‘t’ where either z or t = (x-m)/SEx [m is population mean] Compare z or t to the critical value, based on significance level (e.g. z = 1.96) For t, find critical value, with df = n – 1 If actual z or t is further from zero than the critical value, reject the null hypothesis Lecture #4 32 Evaluating Critical z or t Value Applies to 2-tail testing Reject Null Hypothesis Accept Null Hypothesis Reject Null Hypothesis X - Crit. Value of z or t INFO 515 0 Lecture #4 + Crit. Value of z or t Actual z or t value 33 Student’s T With fairly large samples (25 and up) we can use z values and the two critical values When n is less than 25 (and some say less than 30) we should use the t distribution instead of z values INFO 515 Critical z value only depends on confidence level Critical t value depends on confidence level and df Lecture #4 34 Student’s T Good news: t is calculated the same as z (same formula) BUT - the critical value of t varies as a function of the degrees of freedom (df) INFO 515 Degree of freedom df = n-1 Essentially, z distribution assumes df = ∞ (or at least df is very large) So the Student’s t test corrects for small sample sizes Lecture #4 35 T Example Set significance Level at .05 Last year’s average DVD price: $20.00 Random sample of n = 16 titles had a mean of $24.00, standard deviation of $3.00; is this different? INFO 515 Calculate standard error =s/n = 3/4 = .75 Calculate t= (sample mean - population mean)/standard error t = (24-20)/.75 = 5.33 Lecture #4 36 T Example Calculate df = 16-1= 15 degrees of freedom Go to find critical t value (Action Research handout p. 40/41) Critical value here is 2.131 (df=15, two tail significance of .05) Our actual t value of 5.33 is greater than our critical value – so reject the null hypothesis (the sample cost is different than last year’s average) INFO 515 Lecture #4 37 1-tail versus 2-tail? A one-tail test is used when a specific direction of difference is to be tested Sample DVD price is not greater than last year’s A two-tail test is used when no particular direction of difference is to be tested INFO 515 Sample DVD price is not different than last year’s (could be greater or less than) Use two tail test most of the time to find critical t value Lecture #4 38 Plotting Descriptive Statistics INFO 515 Lecture #4 39 More Descriptive Statistics Now we look at how one statistic may vary in more detail Boxplots and stem-and-leaf diagrams help tell more about the distribution of a variable, especially for cases where the distribution isn’t very normal INFO 515 Lecture #4 40 Boxplots Boxplots, also known as box-and-whisker displays, help show odd, lopsided distributions of data They show: INFO 515 The median (not mean) value in a box, where The box covers the 25th through 75th percentiles, and whose top and bottom edges are called “hinges” Lecture #4 41 Boxplots Also shows INFO 515 The largest and smallest values which are not outliers (these are whiskers on the box) And they show individual values which are more than 1.5 (“Outliers”) and more than 3.0 box-lengths (“Extremes”) from the 25th and 75th percentiles (these are separate symbols “0” for outliers and “*” for extremes) Lecture #4 42 Boxplot Example from MedLibs 5 Extreme (>3) 4 D D Outlier (>1.5) 3 Q D E E Z 2 E Largest value Z 1 75th percentile 0 Median -1 25th percentile -2 N= 26 26 26 26 26 Zscore: Annua l circ Zscore: Annua l inho Zscore: Stude nt pop Zscore: Annua l expe Zscore: Staff s ize INFO 515 Lecture #4 Smallest value 43 Boxplot of Exam Scores 110 100 90 80 70 60 50 10 40 33 30 N= 40 EXAM INFO 515 Lecture #4 44 Generating Boxplots Use the menu Graphs / Boxplot… Select a Simple example for “Summaries of separate variables”, then click Define Select the variables to be plotted (move into Boxes Represent) Click “OK” INFO 515 Lecture #4 45 Stem-and-Leaf Diagram The stem-and-leaf diagram is a sideways histogram which provides additional information about specific data values, and flags extreme values This only works easily for up to a few dozen data points; maybe a couple hundred Imagine taking your data and rounding it off to two significant digits INFO 515 3.567 becomes 3.6; 1.213 becomes 1.2; etc. Lecture #4 46 Stem-and-Leaf Diagram Then move the decimal point in between the two significant digits, remembering how far you had to move it The first significant digit is called the stem The second significant digit of each data point is the leaf; and the leaves are grouped together when they share a stem INFO 515 Lecture #4 47 Stem-and-Leaf Diagram For example, for ages ranging from 60’s to 80’s, make the first digit the stem, and the second digit the leaf (86 becomes stem of 8, and a leaf of 6) Then group the stems and provide a single row of numbers to list every data value with that stem – so data 82, 85, 89 become 8 . 259 (stem 8, & leaves 2, 5, and 9) INFO 515 Lecture #4 48 Stem-and-Leaf Diagram To show that a stem of ‘8’ means ’80’, define the “stem width” as a multiplier for the stem to get its true value So if the stem and leaf diagram shows a value of 8.2, with a stem width of 10, then the actual data value was 8.2*10 = 82 Extreme values are reported separately as “Extremes” with their approximate range INFO 515 Lecture #4 49 EXAMPLE Stem-and-Leaf Plot Frequency This row represents: Eight data points – four are 81, two are 82, and one each 83 and 84 Stem & 2.00 Extremes 1.00 5 . .00 5 . 3.00 6 . 3.00 6 . 4.00 7 . 11.00 7 . 8.00 8 . 4.00 8 . 3.00 9 . 1.00 9 . Stem width: Each leaf: INFO 515 Lecture #4 Leaf (=<46) 2 013 678 0004 56677778899 11112234 6778 224 8 10 1 case(s) 50 Stem-and-Leaf Diagram If a very large number of data points are close together, additional codes can be used in place of the period to break up the bars INFO 515 “*” means leaves of 0 or 1 “t” means 2 or 3 “f” means 4 or 5 “s” means 6 or 7 “.” (period) means 8 or 9 Lecture #4 51 Stem-and-Leaf Diagram So the stem and leaf of: 1 t 23 1 f 445 1 . 899 Would mean data values of (for width = 10) 12, 13, 14, 14, 15, 18, 19, and 19 instead of stuffing them all on one row 1 . 23445899 INFO 515 Lecture #4 52 Frequency Polygon The polygon is constructed much the same as a histogram, including the "¾ high rule," except a dot instead of a bar is placed at the frequency with which each value occurs and a line is drawn between the points You can see this polygon (next slide) shows the actual value of the score, its frequency, and the shape of the distribution, but the intervals are unequal between data points INFO 515 If you look at the horizontal (X) axis, you can see, for example, the numbers go from 101 to 110, and then from 120 to 123 (unequal intervals!) Lecture #4 53 Frequency Polygon for Class IQ Frequency Polygon of Class I.Q. 4.5 4 Frequency 3.5 3 2.5 2 1.5 1 0.5 13 2 13 5 14 0 12 6 12 8 13 0 12 0 12 3 12 5 10 1 11 0 0 Class I.Q. INFO 515 Lecture #4 54 Histogram vs Frequency Polygon How do you decide whether to use a histogram or a frequency polygon? A Histogram is used for discrete data (counting units, like the bookmobile stops) The Frequency Polygon is used for continuous data (measurement, like the length or I.Q. distribution) This is a case where the type of tool used depends on the type of data - discrete (ordinal or nominal) or continuous (ratio or interval) INFO 515 Lecture #4 55