Download Recitation, Week 3: Basic Descriptive Statistics and Measures of

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Regression toward the mean wikipedia , lookup

Categorical variable wikipedia , lookup

Transcript
Recitation, Week 3: Basic Descriptive Statistics and Measures of Central Tendency:
1. What does Healey mean by “data reduction”?
a. Data reduction involves using a few numbers to summarize the
distribution of a variable, or an array of data as he calls it.
2. What is the problem with using only a few numbers to summarize the distribution
of a variable?
a. Summarizing a distribution involves using the mean, denoted x , or
standard deviation, denoted σ , to describe the variable. This inevitably
leads to a loss of information (precision and detail).
3. When analyzing descriptive statistics, it is best to describe the data in terms of
percentages as opposed to using the frequency count. Comparisons are difficult to
conceptualize as raw frequencies.
a. EXAMPLE: Instead of saying 20 out of 100 students got 4. on the exam,
say 20% of students got 4. on the exam.
4. What is the difference between percentage and proportion? A percentage is a
proportion multiplies by 100.
5. What is a measure of central tendency?
a. It is a way to summarize the distribution to give you an idea about the
typical case of that distribution, in other words, the center of it.
b. There are three measures of central tendency
i. The mean: describes the typical score
ii. The mode: describes the most recurring score
1. Only used with nominal variables
iii. The median: is the 50th Percentile of the distribution
1. A median is a special case of a percentile, which is the
percentage of cases below which a specific percentage of
cases fall.
c. How does the median differ from the mode and the mean? Unlike the
mode or the mean, the always represents the exact center of a distribution
of scores, meaning that 50% of the cases always fall above the median and
50% of the cases always fall below the median.
d. Characteristics of the mean
i. The mean is always the center of any distribution. The mean is
the point around which all of the scores cancel out.
Mathematically, this says that if I subtract the mean from each
value and sum the results, the resulting sum will be equal to 0.
ii. The mean may often be very misleading because it is sensitive
to all observations whereas the median is not. In fact, the
median is less sensitive to extreme observations and therefore it is
often “better” to report the median.
1. To illustrate this, consider the familiar normal or “bell”
curve. This is a symmetric distribution because there are as
many values on the left as there are on the right of the
center. Many natural phenomena have normal distributions,
such as weight, height, etc.
2. There are important distributions that are not symmetric.
When a distribution is not symmetric, it is skewed. There
are two types of skewed distributions, right skewed and left
skewed.
3. EXAMPLE of RIGHT SKEWED: Income. Often it is
better to report the median than the mean, since the mean is
misleading in extreme cases.
a. EXAMPLE. Consider the following summary of
AGE. Notice that the arithmetic mean is somewhat
greater than the median. The reason is that the
distribution is right skewed. If the mean is larger
than the median the distribution is __________
skewed.
Statistics
AGE OF RESPONDENT
N
Valid
1385
Missing
2
Mean
44.94
Median
41.00
To see this, create a histogram of the age variable.
300
200
100
Std. Dev = 17.08
Mean = 44.9
N = 1385.00
0
20.0
30.0
25.0
40.0
35.0
50.0
45.0
60.0
55.0
70.0
65.0
80.0
75.0
90.0
85.0
AGE OF RESPONDENT
6. What is a “measure of dispersion?”
a. Measures of Central Tendency don’t tell anything about how much the
data values differ from each other.
i. EXAMPLE: What is the mean of the following two distributions
of AGE?
1. 50 50 50 50 50
2. 10 20 50 80 90
ii. The distributions are obviously very different.
b. Measures of dispersion or variability attempt to quantify the spread of
observations.
c. It is a measure of variability, usually defined in terms of variability around
the mean.
d. The distance between the individual score and the mean value,
mathematically this is ( X i − X ).
e. The larger the distance from the mean, the larger the deviation will be.
f. If the scores were clustered around the mean, the less variability there will
be.
i. PRACTICAL EXAMPLE: Let’s assume that average income for
people with PhD’s is $55,000 and average income for people with
a high school education is $20,000. Since opportunities for people
with merely a HS education are less than those with PhD’s most
people who only have a HS education would make somewhere
aroung 20K, there is not much variation. However, it is possible
for PhDs to make anywhere from $20K to $800K per year and
hence there is much more variation around the average salary for
PhDs than there is for HS graduates.
7. USING SPSS to Produce Measures of Dispersion
a. Use Descriptives to find the range and standard deviation for age,
educ and tvhours
b. To reproduce this output first open the gss98randsamp.save
c. Go to the familiar Analyze ! Descriptive Statistics ! Frequencies
d. Put the variables corresponding to age, educ and tvhours in the box
labeled “Variable(s)”
e. Click the “Statistics” button and check mean, minimum, maximum
and standard deviation
f. Click continue, then OK
g. Notice that the chart in the book looks a little bit different, so lets
transpose the rows and columns to make it look like Healey.
h. Double click on the output window
i. Then from the menu select Pivot ! Transpose Rows and Columns,
you should get what is shown below.
Statistics
N
Valid
AGE OF
RESPONDENT
HIGHEST YEAR OF
SCHOOL
COMPLETED
HOURS PER DAY
WATCHING TV
Missing
Mean
Std. Deviation
Minimum
Maximum
1385
2
44.94
17.080
18
89
1381
6
13.37
2.857
0
20
1134
253
2.86
2.197
0
21
j. How do we interpret these results? What is 1 standard deviation
above the mean for the variable tvhours?
8. USING THE COMPUTE COMMAND to create an Attitude towards
abortion scale
a. There are two distinct measures on attitudes toward abortion in the 1998
GSS survey. One variable, abany, asks the respondent to state whether
they believe that abortion should be allowed for any reason. The other,
abhlth, measure whether they feel abortion should only be allowed to
preserve the health of the woman.
b. We want to create a summary measure that gives us an overall measure of
attitude toward abortion.
c. We must know something about the data. If response on abany is 1 then
the person was in favor of abortion for any reason. If the value in the
dataset is 2 then the person was opposed. Similar thing for abhlth, 1 = in
favor of abortion if health is at stake, 2 = not in favor.
d. We want an overall measure of anti-abortion position. So we will sum the
variables. If the person was in favor of both, then our new variable will
have a value of 2 (1+1). In favor of one and not the other gives a value of
3 (2+1 or 1+2). If a person in completely against abortion, the value is
going to be 4.
e. We need to use the Compute command. To open the Compute Variable
dialog box from the menus choose: Transform ! Compute
f. In the Compute Variable Dialog box, type abscale, which will represent
the variable we are creating.
g. Click the button Type and Label and type “Abortion Scale”, then Continue
h. Select (or type) abany in the variable list and move it into the Numeric
Expression box. Then type + and then abhlth and OK.
i.
Get the frequency distribution of each variable.
Statistics
N
Valid
ABORTION IF
WOMAN WANTS
FOR ANY REASON
WOMANS HEALTH
SERIOUSLY
ENDANGERED
Abortion Scale
Missing
Mean
Std. Deviation
Minimum
Maximum
887
500
1.58
.494
1
2
895
492
1.12
.321
1
2
855
532
2.6865
.67404
2.00
4.00
ABORTION IF WOMAN WANTS FOR ANY REASON
Valid
Missing
Total
YES
NO
Total
NAP
DK
NA
Total
Frequency
372
515
887
449
49
2
500
1387
Percent
26.8
37.1
64.0
32.4
3.5
.1
36.0
100.0
Valid Percent
41.9
58.1
100.0
Cumulative
Percent
41.9
100.0
WOMANS HEALTH SERIOUSLY ENDANGERED
Valid
Missing
YES
NO
Total
NAP
DK
NA
Total
Total
Frequency
791
104
895
449
42
1
492
1387
Percent
57.0
7.5
64.5
32.4
3.0
.1
35.5
100.0
Valid Percent
88.4
11.6
100.0
Cumulative
Percent
88.4
100.0
Abortion Scale
Valid
Missing
Total
2.00
3.00
4.00
Total
System
Frequency
370
383
102
855
532
1387
Percent
26.7
27.6
7.4
61.6
38.4
100.0
Valid Percent
43.3
44.8
11.9
100.0
Cumulative
Percent
43.3
88.1
100.0
Note: for category 3.00, we know these are the situations where the person approved in
one situation but not in the other, but we do not know which situation they approved. It
seems reasonable that they approved when life of the mother was at stake but not for any
reason, but we would have to use other procedures to find that out.
SPSS companion exercises
2.5, but choose 1 variable from world.sav, recode it, get frequency distributions for the
variable, and summarize the results.
3.4
4.4
4.6