Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Recitation, Week 3: Basic Descriptive Statistics and Measures of Central Tendency: 1. What does Healey mean by “data reduction”? a. Data reduction involves using a few numbers to summarize the distribution of a variable, or an array of data as he calls it. 2. What is the problem with using only a few numbers to summarize the distribution of a variable? a. Summarizing a distribution involves using the mean, denoted x , or standard deviation, denoted σ , to describe the variable. This inevitably leads to a loss of information (precision and detail). 3. When analyzing descriptive statistics, it is best to describe the data in terms of percentages as opposed to using the frequency count. Comparisons are difficult to conceptualize as raw frequencies. a. EXAMPLE: Instead of saying 20 out of 100 students got 4. on the exam, say 20% of students got 4. on the exam. 4. What is the difference between percentage and proportion? A percentage is a proportion multiplies by 100. 5. What is a measure of central tendency? a. It is a way to summarize the distribution to give you an idea about the typical case of that distribution, in other words, the center of it. b. There are three measures of central tendency i. The mean: describes the typical score ii. The mode: describes the most recurring score 1. Only used with nominal variables iii. The median: is the 50th Percentile of the distribution 1. A median is a special case of a percentile, which is the percentage of cases below which a specific percentage of cases fall. c. How does the median differ from the mode and the mean? Unlike the mode or the mean, the always represents the exact center of a distribution of scores, meaning that 50% of the cases always fall above the median and 50% of the cases always fall below the median. d. Characteristics of the mean i. The mean is always the center of any distribution. The mean is the point around which all of the scores cancel out. Mathematically, this says that if I subtract the mean from each value and sum the results, the resulting sum will be equal to 0. ii. The mean may often be very misleading because it is sensitive to all observations whereas the median is not. In fact, the median is less sensitive to extreme observations and therefore it is often “better” to report the median. 1. To illustrate this, consider the familiar normal or “bell” curve. This is a symmetric distribution because there are as many values on the left as there are on the right of the center. Many natural phenomena have normal distributions, such as weight, height, etc. 2. There are important distributions that are not symmetric. When a distribution is not symmetric, it is skewed. There are two types of skewed distributions, right skewed and left skewed. 3. EXAMPLE of RIGHT SKEWED: Income. Often it is better to report the median than the mean, since the mean is misleading in extreme cases. a. EXAMPLE. Consider the following summary of AGE. Notice that the arithmetic mean is somewhat greater than the median. The reason is that the distribution is right skewed. If the mean is larger than the median the distribution is __________ skewed. Statistics AGE OF RESPONDENT N Valid 1385 Missing 2 Mean 44.94 Median 41.00 To see this, create a histogram of the age variable. 300 200 100 Std. Dev = 17.08 Mean = 44.9 N = 1385.00 0 20.0 30.0 25.0 40.0 35.0 50.0 45.0 60.0 55.0 70.0 65.0 80.0 75.0 90.0 85.0 AGE OF RESPONDENT 6. What is a “measure of dispersion?” a. Measures of Central Tendency don’t tell anything about how much the data values differ from each other. i. EXAMPLE: What is the mean of the following two distributions of AGE? 1. 50 50 50 50 50 2. 10 20 50 80 90 ii. The distributions are obviously very different. b. Measures of dispersion or variability attempt to quantify the spread of observations. c. It is a measure of variability, usually defined in terms of variability around the mean. d. The distance between the individual score and the mean value, mathematically this is ( X i − X ). e. The larger the distance from the mean, the larger the deviation will be. f. If the scores were clustered around the mean, the less variability there will be. i. PRACTICAL EXAMPLE: Let’s assume that average income for people with PhD’s is $55,000 and average income for people with a high school education is $20,000. Since opportunities for people with merely a HS education are less than those with PhD’s most people who only have a HS education would make somewhere aroung 20K, there is not much variation. However, it is possible for PhDs to make anywhere from $20K to $800K per year and hence there is much more variation around the average salary for PhDs than there is for HS graduates. 7. USING SPSS to Produce Measures of Dispersion a. Use Descriptives to find the range and standard deviation for age, educ and tvhours b. To reproduce this output first open the gss98randsamp.save c. Go to the familiar Analyze ! Descriptive Statistics ! Frequencies d. Put the variables corresponding to age, educ and tvhours in the box labeled “Variable(s)” e. Click the “Statistics” button and check mean, minimum, maximum and standard deviation f. Click continue, then OK g. Notice that the chart in the book looks a little bit different, so lets transpose the rows and columns to make it look like Healey. h. Double click on the output window i. Then from the menu select Pivot ! Transpose Rows and Columns, you should get what is shown below. Statistics N Valid AGE OF RESPONDENT HIGHEST YEAR OF SCHOOL COMPLETED HOURS PER DAY WATCHING TV Missing Mean Std. Deviation Minimum Maximum 1385 2 44.94 17.080 18 89 1381 6 13.37 2.857 0 20 1134 253 2.86 2.197 0 21 j. How do we interpret these results? What is 1 standard deviation above the mean for the variable tvhours? 8. USING THE COMPUTE COMMAND to create an Attitude towards abortion scale a. There are two distinct measures on attitudes toward abortion in the 1998 GSS survey. One variable, abany, asks the respondent to state whether they believe that abortion should be allowed for any reason. The other, abhlth, measure whether they feel abortion should only be allowed to preserve the health of the woman. b. We want to create a summary measure that gives us an overall measure of attitude toward abortion. c. We must know something about the data. If response on abany is 1 then the person was in favor of abortion for any reason. If the value in the dataset is 2 then the person was opposed. Similar thing for abhlth, 1 = in favor of abortion if health is at stake, 2 = not in favor. d. We want an overall measure of anti-abortion position. So we will sum the variables. If the person was in favor of both, then our new variable will have a value of 2 (1+1). In favor of one and not the other gives a value of 3 (2+1 or 1+2). If a person in completely against abortion, the value is going to be 4. e. We need to use the Compute command. To open the Compute Variable dialog box from the menus choose: Transform ! Compute f. In the Compute Variable Dialog box, type abscale, which will represent the variable we are creating. g. Click the button Type and Label and type “Abortion Scale”, then Continue h. Select (or type) abany in the variable list and move it into the Numeric Expression box. Then type + and then abhlth and OK. i. Get the frequency distribution of each variable. Statistics N Valid ABORTION IF WOMAN WANTS FOR ANY REASON WOMANS HEALTH SERIOUSLY ENDANGERED Abortion Scale Missing Mean Std. Deviation Minimum Maximum 887 500 1.58 .494 1 2 895 492 1.12 .321 1 2 855 532 2.6865 .67404 2.00 4.00 ABORTION IF WOMAN WANTS FOR ANY REASON Valid Missing Total YES NO Total NAP DK NA Total Frequency 372 515 887 449 49 2 500 1387 Percent 26.8 37.1 64.0 32.4 3.5 .1 36.0 100.0 Valid Percent 41.9 58.1 100.0 Cumulative Percent 41.9 100.0 WOMANS HEALTH SERIOUSLY ENDANGERED Valid Missing YES NO Total NAP DK NA Total Total Frequency 791 104 895 449 42 1 492 1387 Percent 57.0 7.5 64.5 32.4 3.0 .1 35.5 100.0 Valid Percent 88.4 11.6 100.0 Cumulative Percent 88.4 100.0 Abortion Scale Valid Missing Total 2.00 3.00 4.00 Total System Frequency 370 383 102 855 532 1387 Percent 26.7 27.6 7.4 61.6 38.4 100.0 Valid Percent 43.3 44.8 11.9 100.0 Cumulative Percent 43.3 88.1 100.0 Note: for category 3.00, we know these are the situations where the person approved in one situation but not in the other, but we do not know which situation they approved. It seems reasonable that they approved when life of the mother was at stake but not for any reason, but we would have to use other procedures to find that out. SPSS companion exercises 2.5, but choose 1 variable from world.sav, recode it, get frequency distributions for the variable, and summarize the results. 3.4 4.4 4.6