Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lecture 2 Measures of Central Tendency and Variability Pros and Cons of Tables and Graphs Pros 1. Easy for the laypeople to understand. 2. Many are fairly easy to construct. 3. Show the complexities of distributions and comparisons of distributions – central tendency, variability, shape, outliers all in one presentation. 4. Particularly good for identifying problem distributions and outliers. 5. Don’t require specific distribution shape, such as normality. Cons (relative to numeric summaries) 1. Take up space. 2. Are not amenable to further computations – no analog to a mean of means, for example. 3. Richness of information may get in the way when dealing with data with identical distributions. 4. Not useful for generalizing from samples to populations. Numeric Summaries Single values chosen to represent a characteristic of data. Measures of central tendency Single values chosen to represent central tendency of a collection. Measures of variability – Single values chosen to represent variability of a collection. Measures of skewness – Ditto for skewness of a distribution Measures of kurtosis – Ditto for how similar the distribution is to the normal distribution Looking ahead Measures of correlation – The extent to which values of one variable covary with paired values of another variable. Copyright © 2005 by Michael Biderman Measures of CT and Variability - 1 6/27/2017 Missing Data Why consider missing data here? Because the presence of missing data complicates the computation and representation of data using the numeric summaries we’re about to cover. Reasons for missing data include 1) respondents failing to answer questions in a survey. 2) values incorrectly entered into the computer. 3) values that represent “Don’t Know” or “Don’t Care” or “Won’t tell you” responses. In SPSS parlance, a missing value is a data value that was entered in order to represent the fact that a score is in fact, missing. In SPSS, an empty cell in the data editor stands for a missing value. But in many situations, an actual value must be recorded when there is a missing response. Such values are the missing values we’re dealing with here. Missing values are not a terribly important issue when frequency distributions and graphs are used to summarize data because they’re just part of the summary. But when a statistic is to be computed, values that “don’t count” should not be included in the computation. The statistical package has to be told that such values are special and are not to be included in computation of statistics. Missing data are represented in SPSS in two ways. 1) Empty cells in the Data Editor window. These are called SYSTEM MISSING. 2) Actual values entered into the Data Editor window but given “Missing Value” status by you. To tell SPSS that one of the values of a variable is to be treated as a “Missing Value”, 1) Click on the Variables tab at the lower left of the Data Editor window. 2) Click under “Missing” in the same row as the variable for whom Missing values are to be declared. 3) Enter the values to be treated as missing in the dialog box shown below. Copyright © 2005 by Michael Biderman Measures of CT and Variability - 2 6/27/2017 Measures of central tendency From worst to best The Mode: 5 6 7 7 7 7 8 9 10 11 13 Definition: Value that occurred most frequently in the collection. Problems 1) Often not computable, especially with small samples. E.g., What’s the mode of 3,4,5,5,6,7,8,8,9? 2) Very unstable (unreliable) from sample to sample. Should only be reported . . . 1) When it dominates the data, e.g., 70% of scores are one value. 2) When data are nominal, e.g., gender, ethnic group, in which case other quantitative measures are not appropriate Don’t report it (on penalty of lost points) in other situations The median Conceptual definition: Value above which and below which 50% of scores fall. How about: 2 4 6 8 Operational definition: 1) Order the scores. 2) For odd N, median is middle score in the ordered list. For even N, median is the average of the two middle scores in the ordered list. 6777799 Median = 7 Example 1 – N odd X’s: 81, 69, 77, 93, 96, 99, 83, 85, 75, 89, 94 Ordered: 69, 75, 77, 81, 83, 85, 89, 93, 94, 96, 99. Median is 85. 677778 Median = 7 Example 2 – N even X’s: 81, 69, 77, 93, 96, 99, 83, 85, 75, 89, 94, 57 Ordered: 57, 69, 75, 77, 81, 83, 85, 89, 93, 94, 96, 99. Median is (83+85)/2 = 84. Pros 1. Gives an indication of the center of the distribution. 2. Usually not affected by outliers. E.g., Median of 69, 75, 77, 81, 83, 85, 89, 93, 94, 96, 999 is 85. So the 999 didn’t affect it. Robust with respect to outliers. 3. All in all, a very good measure. Cons 1. For normally distributed data for which there are absolutely no outliers, median is slightly less stable from sample to sample than the mean. 2. Not a part of the normal distribution. Not descended from royalty. Copyright © 2005 by Michael Biderman Measures of CT and Variability - 3 6/27/2017 The mean Best Definition: Arithmetic average of the scores. Mean Median Weighted sum of the scores with weighting equal to 1/N. Symbols Group: Symbol: Sample X or MX Population µ (Pronounced myou. If you mated a cat that says “meow” and a cow that says “moo”, the offspring would say “mu”. Pros 1. Good heritage – comes from royalty. It’s a part of the normal distribution formula. 2. For normally distributed data with no outliers, most stable from sample to sample. 3. Computation is straightforward, doesn’t involve sorting. Cons 1. Can be dramatically affected by outliers. Worst Mode For example, mean of 69, 75, 77, 81, 83, 85, 89, 93, 94, 96, 99 from above is 82.8. But the mean of 69, 75, 77, 81, 83, 85, 89, 93, 94, 96, 999 is 167.4, a value not close to ANY of the original scores. Compare this with the median of the above data. You should always compute both and compare them. 2. Related to the above, many analysts feel that the mean is unrepresentative of skewed data. So compute the median AND the mean. If they’re approx equal, then use mean. If they’re different, then probably the median is more appropriate. Trimmed mean Definition: Mean of the scores remaining after the largest K% and smallest K% have been removed. Typically, K is 5. Having your cake and eating it too. Olympic tradition. Pros. 1. Less affected by outliers. Cons 1. Still not representative of skewed data, in my view. Copyright © 2005 by Michael Biderman Measures of CT and Variability - 4 6/27/2017 When to use the various measures of Central Tendency Memorize this table. Make a locket out of it. I. Numeric Variables No Outliers Outliers may be present Distribution Shape Unimodal and Symmetric (US) Skewed Mean Median Median Median Trimmed Mean II. Nominal Data. The mode is the only measure that makes sense when you're attempting to summarize nominal data. Copyright © 2005 by Michael Biderman Measures of CT and Variability - 5 6/27/2017 Measures of Variability The Range Definition: Difference between largest score and smallest. 2 problems. 1. Range is restricted whenever score values are restricted. Use of 5-point scales on questionnaires is a good example. 2. Range is unstable from sample to sample. Don’t use as the primary measure. Copyright © 2005 by Michael Biderman Measures of CT and Variability - 6 6/27/2017 The Interquartile Range Quartiles: Points identifying "quarters" of a distribution. Conceptual Definitions Q4 Fourth Quartile The value below which 4/4th's of the distribution falls. Q3 Third Quartile The value below which 3/4ths of the distribution falls. Q2 Second Quartile The value below which 2/4ths of the distribution falls. Q1 First Quartile The value below which 1/4th of the distribution falls. Q0 "Zeroth" Quartile The value below which 0/4th's of the distribution falls. Operational Definitions Q4 The largest score in the distribution. Q3 The median of the upper half of the distribution. (If N is odd, include the overall median in the upper half.) Q2 The overall median of the collection. Compute using the median formula. Q1 The median of the lower half of the distribution.. (If N is odd, include the overall median in the lower half.) Q0 The smallest score in the distribution. Interquartile Range: The distance (on the number line) between the Q1 and Q3 - between the first quartile and the third quartile. IQR = Q3 - Q1 Interpretation The distance or interval size required to contain the middle 50% of the scores. If the middle 50% is contained in a small area, the distribution is quite "crowded" - the scores are close to each other; the distribution has little variability. If the middle 50% is contained in a wide area, the distribution is sparse - the scores are far from either other; the distribution has much variability. Copyright © 2005 by Michael Biderman Measures of CT and Variability - 7 6/27/2017 Example - A distribution with an even number of scores. Upper half of distribution 75 65 50 45 40 40 35 35 30 30 30 25 25 10 IQR = 45 – 30 = 15. Example - A distribution with an odd number of scores. Note that 35, the overall median is included in both the lower and upper halves. Upper half of distribution Lower half of distribution 65 50 45 40 35 35 30 25 25 20 15 IQR = 42.5 – 25 = 17.5 Copyright © 2005 by Michael Biderman Measures of CT and Variability - 8 6/27/2017 Conscientiousness scale scores from the Bias Study Questionnaire Packet administered at the beginning of semester in 2008. Each person’s score was the mean of either 10 items (IPIP) or 12 items (NEO). For each, the response scale was a 5-point scale, numbered from 1 to 5. Distribution of Conscientiousness scores from the IPIP Personality Questionnaire. Interquartile range = 4.00 – 3.30 = 0.70 FYI: Mean = 3.60 SD = 0.61 Distribution of Conscientiousness scores from the NEO-FFI Personality Questionnaire Interquartile range = 3.67 – 3.33 = 0.34 FYI: Mean = 3.48 SD = 0.27 Both the IPIP questionnaire at the top and the NEO questionnaire at the bottom were scored on the same 5-point scale. The major difference between the two is in the variability of responses. Variability of scale scores was greater for the IPIP questionnaire than for the NEO. Copyright © 2005 by Michael Biderman Measures of CT and Variability - 9 6/27/2017 Variance Definition 1 The sum of the squared differences of the scores from the mean divided by N. This is the “dividing by N” definition. Use this for populations. Definition 2: The sum of the squared differences of the scores from the mean divided by N-1. This is called the, you guessed it, “dividing by N-1” definition. Use this for samples. The variance is a useful theoretical measure of variability, but it’s not useful as descriptive measure because it’s in squared units. Variance is part of the normal distribution formula, so it has good roots. Variance is a part of many formulas (e.g., t, F) in inferential statistics. Standard Deviation Definition 1: Square root of the sum of the squared differences of the scores from the mean divided by N That is, the standard deviation is the square root of the variance. Definition 2: Square root of the sum of the squared differences of the scores from the mean divided by N-1. Wait! Is this daja vu all over again. Do these seem familiar? The standard deviation is simply the square root of the variance. Symbols Group Sample Sample Population Population Measure Variance Standard Deviation Variance Standard Deviation Symbol S2 S σ2 σ Formula Σ(X-Mean)2 --------N – 1 Σ(X-Mean)2 -----------N – 1 Σ(X-Mean)2 ----------N Σ(X-Mean)2 ------------N Pros 1. Good roots – is in the normal distribution formula. 2. Generally regarded as best for normal distributions (with no outliers). Cons 1. Inflated by the presence of outliers. Can be dramatically inflated by them. 2. What’s it mean?? Copyright © 2005 by Michael Biderman Measures of CT and Variability - 10 6/27/2017 Facts about the Standard deviation Assume you have a large (e.g., N >= 30) collection of scores that are unimodal and symmetric. 1. About 2/3 of the scores will be within 1 SD of the mean About 2/3 of scores in here Mean Mean - SD Mean + SD 2. About 95% of the scores will be within 2 SDs of the mean About 95% of scores in here Mean - 2 SD Mean Mean - SD Mean + SD Mean + 2 SD So, if you scored 2 standard deviations about the mean in Conscientiousness, what would be your approximate percentile rank? 2 SDs above the mean would be 3.6 + .61 + .61 = 4.83. Two SDs below is 3.6 – 1.22 = 2.4 Wrap up. No outliers Outliers possible US distribution Standard deviation IQR Copyright © 2005 by Michael Biderman Skewed Distribution IQR IQR Measures of CT and Variability - 11 6/27/2017 Making use of both scale level and scale variability Data: IPIP Conscientiousness Scale. Excerpt from Data Editor Compare lines 1 and 8 – both have the same scale level (4.00) but 8 is much more variable than 1. Compare lines 17 and 20 – both have the same variability (1.07) but 20 has a higher scale value than 17. Copyright © 2005 by Michael Biderman Measures of CT and Variability - 12 6/27/2017 Distributions of level Predicted to have the highest GPA. and variability . . . Note that both distributions are approximately unimodal and symmetric, although the distribution of standard deviations is slightly positively skewed. We have found that conscientiousness (gencon in the above graph) is a valid predictor of GPA. We’ve found it, as have a probably more than 100 other researchers. It’s not a perfect predictor, but it has been found to be statistically significant in a vast majority of studies. People who score high on conscientiousness scales generally get better grades than people with the same intelligence who score lower on conscientiousness. Now here’s something that is almost new to our research here at UTC: We have found that variability in selfreport (sgencon in the above) is ALSO a valid predictor of GPA. Only about 5 studies have found that – all of them conducted here at UTC. The relationship is inverse. People who are more inconsistent in their selfreports (who have higher sgencon values) have slightly LOWER GPAs than people who are less inconsistent. Copyright © 2005 by Michael Biderman Measures of CT and Variability - 13 6/27/2017 Measures of distribution shape Measures of skewness A popular measure of skewness is the following, given by Kirk, R. (1999). Statistics: An introduction. 4th Ed. New York: Harcourt Brace. Skewness = (Σ(X-Mean)3 / N ) / S3 In English: The sum of the cubed deviations of scores from the mean divided by N, then divided by the cube of the standard deviation. Interpretation of values Value of Skewness measure Interpretaton Larger than 0 Positively skewed distribution 0 Symmetric distribution Less than 0 Negatively skewed distribution Copyright © 2005 by Michael Biderman Measures of CT and Variability - 14 6/27/2017 Example of the skewness statistic 1. Salaries from the Employee Data file. 2. Extroversion scores of 109 UTC students Sta tistic s sal ary Curren t Sa lary N Va lid 47 4 Mi ssing Ske wne ss 2.1 25 Std . Erro r of S kewness Sta tistic s 0 .11 2 he xt N Va lid 10 9 Mi ssing Histogram 1 Ske wne ss -.2 20 Std . Erro r of S kewness .23 1 120 Histogram 100 14 Frequency 80 12 60 10 20 0 $0 Mean = $34,419.57 Std. Dev. = $17,075.661 N = 474 $40,000 $80,000 $120,000 $20,000 $60,000 $100,000 $140,000 Frequency 40 8 6 4 Current Salary 2 Mean = 4.4582 Std. Dev. = 0.95104 N = 109 0 0.00 2.00 4.00 6.00 8.00 hext Copyright © 2005 by Michael Biderman Measures of CT and Variability - 15 6/27/2017 Kurtosis Kurtosis refers to the relationship of the shape of a distribution to the shape of the Normal Distribution. Kirk gives the following measure of Kurtosis Kursosis = ( (Σ(X-Mean)4 / N ) / S4 ) - 3 In English: The sum of the deviations of scores from the mean raised to the fourth power divided by N, then divided by the standard deviation raised to the fourth power minus 3. Interpretation Value of Kurtosis measure Interpretaton Larger than 0 More peaked than the Normal distribution 0 Same peakedness as the Normal distribution. Less than 0 Less peaked (flatter) than the Normal distribution. Copyright © 2005 by Michael Biderman Measures of CT and Variability - 16 6/27/2017 Example 1. Extroversion scores of 109 UTC students Sta tistic s hext N Va lid 109 Missing 1 Ku rtosis -.37 1 Std . Erro r of K urtosis .45 9 Histogram 25 Frequency 20 15 10 5 Mean = 4.4582 Std. Dev. = 0.95104 N = 109 0 0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 hext Although it’s not immediately apparent from the histogram, according to the Kurtosis measure the distribution is slightly less peaked than the Normal Distribution. Copyright © 2005 by Michael Biderman Measures of CT and Variability - 17 6/27/2017 Importing Data from Excel 1. Importing Data from Excel using SPSS’s built-in Importing capabilities. Demo with ‘G:\MDBT\InClassDatasets\TennesseeHospitalSurvey for class pres.xls’ A. From SPSS: File -> Open -> Data (Choose .Excel(*.xls) under “Files of type:”. Check all data very carefully. Sometimes the data won’t be put into SPSS in the way you believe they should. Problem areas . . i. Date and Time variables. ii. Columns of numbers which happen to have a blank cell or a string character in the first cell of the column. Make the appropriate choice in the following dialog box. If the Excel file has names in the first row, leave the “Read variable names from the first row of data” checked. If there are no variable names in the Excel file, uncheck that box. Copyright © 2005 by Michael Biderman Measures of CT and Variability - 18 6/27/2017 The Excel file . . . The SPSS file . . . 2. Importing data from Excel by copying and pasting. A. Open a blank SPSS data editor window. B. Open the file within Excel. C. Highlight a column and choose “Copy”. D. Click on the top cell of the column in which data are to be pasted in SPSS. E. Choose “Paste”. Check all data very carefully. Problem areas . . . i. if pasting a String (character variable) you must set the column type in SPSS as string before pasting. ii. Columns which have mixtures of strings and numbers will paste in as only strings or only numbers in SPSS. SPSS doesn’t allow mixtures of data types within a column. Copyright © 2005 by Michael Biderman Measures of CT and Variability - 19 6/27/2017