Download Slide 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Regression toward the mean wikipedia , lookup

Transcript
Quantitative Data
Analysis
Chapter 8
What is Quantitative Analysis?
Quantitative analysis is a scientific approach to
answering questions
Raw data are processed and manipulated
resulting in meaningful information
Raw Data
Quantitative
Analysis
Meaningful
Information
Quantitative data analysis
Making sense of the numbers for meaningful
interpretation
It involves:
1.Organizing the data
2.Doing the calculations
3.Interpreting the information
4.Explaining limitations
What are the Options for Summarizing
Distributions?
• Measures of Central Tendency:
• Mode
• Median
• Mean
What are the Options for Summarizing
Distributions?
• Measures of Variation:
• Range
• Interquartile range
• Variance
• Standard deviation
The Mode
The most frequent value in a distribution.
Respondent's Religious Preference (GSS94)
2000
Count
1000
0
PROTESTANT CATHOLIC
JEWISH
NONE
OTHER
RS RELIGIOUS PREFERENCE
In a distribution of Americans’religious affiliations,
Protestant Christian is the most frequently occurring
value—the largest single group.
The Median
The position average, or the point that divides
the distribution in half (the 50th percentile).
HIGHEST YEAR OF SCHOOL COMPLETED
Valid
Mis sing
Total
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
DK
Total
NAP
Frequency
4
1
3
6
12
15
19
29
109
85
102
168
929
277
321
146
433
97
119
46
64
3
2988
4
2992
Percent
.1
.0
.1
.2
.4
.5
.6
1.0
3.6
2.8
3.4
5.6
31.0
9.3
10.7
4.9
14.5
3.2
4.0
1.5
2.1
.1
99.9
.1
100.0
Valid Percent
.1
.0
.1
.2
.4
.5
.6
1.0
3.6
2.8
3.4
5.6
31.1
9.3
10.7
4.9
14.5
3.2
4.0
1.5
2.1
.1
100.0
Cumulative
Percent
.1
.2
.3
.5
.9
1.4
2.0
3.0
6.6
9.5
12.9
18.5
49.6
58.9
69.6
74.5
89.0
92.2
96.2
97.8
99.9
100.0
The median in a
frequency
distribution is
determined by
identifying the
value
corresponding to
a cumulative
percentage of 50.
The Mean
The mean is just the arithmetic average.
Mean = Sum of value of cases/number of
cases
The Mean, cont’d
For example, to calculate the mean value of eight cases, we
add the values of all the cases and divide by the number of
cases (N):
(28 + 117 + 42 + 10 + 77 + 51 + 64 + 55) /8 = 444/8 = 55.5
Measures of Variation
It is important to know that the median
household income in the United States is a
bit over $40,000 a year,
We need to know the Variation in income:
The fact that incomes range from zero up to
hundreds of millions of dollars
Measures of variation capture how widely or
densely spread income (for instance) is.
10/29 Measures of Variation
• Four popular measures of
variation for quantitative
variables are the range, the
interquartile range, the variance,
and the standard deviation
(which is the single most popular
measure of variability).
The Range
The simplest measure of variation
Calculated as the highest value in a distribution
minus the lowest value, plus 1:
Range = Highest value – Lowest value + 1
It often is important to report the range of a
distribution, to identify the whole range of
possible values that might be encountered.
The Range, cont’d.
Say that you surveyed 10 people, and asked
them how many times they saw the movie
Star Wars, and their answers looked like
this:
The range for “times respondent saw
Star Wars” is 20 – 0 + 1= 21.
However, since the range can be drastically
altered by just one exceptionally high or low
value (termed an outlier), it’s not a good
summary measure for most purposes.
Number of times
Respondent saw
Star Wars:
0
2
2
3
4
4
5
20
2
1
Interquartile Range
The interquartile range avoids the
problem created by outliers, by showing
the range where most cases lie.
Quartiles are the points in a distribution
corresponding to the first 25% of the
cases, the first 50% of the cases, and the
first 75% of the cases.
Interquartile Range
• Star Wars example of number of times
respondents saw Star Wars,
• First 25% of cases fall within the range
of 0 and 1.75 times.
• Second quartile fall within the range of
1.75 and 2.5 times.
• Third quartile falls within 2.5 and 4.25
times.
• Last quartile is between 4.25 and 20
times.
Interquartile Range, cont’d
The interquartile range is the difference
between the first quartile and the third
quartile (plus 1).
In our Star Wars example, the interquartile
range is
4.25 – 1.75 + 1 = 3.50.
The Variance
Statistical definition, is the average
squared deviation of each case from
the mean;
•You take each case’s distance from
the mean,
•square that number,
•and take the average of all such
numbers.
Variance
Takes into account the amount by
which each case differs from the mean.
It is affected by outliers, such as the
person who saw Star Wars 20 times.
Mainly useful for computing the
standard deviation, which comes next.
The Standard Deviation
The standard deviation is simply the square root of
the variance. It is the square root of the average
squared deviation of each case from the mean:
s=
å(Yi - Y) 2
N
Symbol key: ¯ Y = mean; N = number of cases; S =
sum over all cases; Yi = value of case i on variable
Y; = square root.
Standard Deviation
Standard deviation has mathematical properties
that make it the preferred measure of variability in
many cases, particularly when a variable is
normally distributed.
10
8
6
4
2
Std. Dev = 12.67
Mean = 75.0
N = 25.00
0
45.0
65.0
85.0
Scores
A graph of a normal distribution looks like a bell, with one
“hump” in the middle, centered around the population
mean, and the number of cases tapering off on both sides
of the mean.
Normal Distribution
A normal distribution is symmetric: If you folded it in half at its center
(at the population mean), the two halves would match perfectly.
10
8
6
4
2
Std. Dev = 12.67
Mean = 75.0
N = 25.00
0
45.0
65.0
85.0
Scores
If a variable is normally distributed, 68% of the cases (almost exactly
2/3) will lie between plus and minus 1 standard deviation from the
distribution’s mean, and 95% of the cases will lie between 1.96
standard deviations above and below the mean.
Normal Distribution
• The normal curve is a tool a
statistician can use to tell how far
the sample is likely to be off from
the overall population, i.e. how big
a "margin of error" there is likely to
be in his/her poll.
Different Statistics for Different Data
Nominal
Mode
X
Ordinal Interval/Ratio
X
X
Median
X
X
Mean
X
X
Range
X
X
Interquartile Range
X
Variance
X
Standard Deviation
X
Relationships between variables
Crosstabulation (cross tabs) display the
distribution of one variable for each
category in another variable
Also known as bivariate distribution
Cross tabs are presented first with
frequencies and then with percentages
Crosstabulation of Voting in 2000 by Family Income: Cell Counts and
Percentages
FAMILY INCOME: CELL COUNTS
Voting
<$20,000
Voted
178
Did not vote
182
Total (n)
(360)
$20,000-$34,999 $35,000 - $59,999
239
135
(374)
364
168
(532)
$60,000+
761
193
(954)
FAMILY INCOME: PERCENTAGES
Voting
<$20,000
Voted
49%
Did not vote 51%
Total
100%
$20,000-$34,999 $35,000 - $59,999
64%
36%
100%
68%
32%
100%
Source: General Social Survey, 2004. Weighted.
$60,000+
80%
20%
100%
Summary statistics describe particular
features of a distribution and facilitate
comparison among distributions.
The next step is to test for
associations . . .
Which calculation do I use? It depends upon what you
want to know.
Do you want to know how many individuals
checked each answer?
Frequency
Do you want the proportion of people who
answered in a certain way?
Percentage
Do you want the average number or average
score?
Mean
Do you want the middle value in a range of values
or scores?
Median
Do you want to show the range in answers or
scores?
Range
Do you want to compare one group to another?
Cross tab
Do you want to report changes from pre to post?
Change score
Do you want to show the degree to which a
response varies from the mean?
Standard deviation