Download Slide 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Regression toward the mean wikipedia , lookup

Transcript
Quantitative Data
Analysis
Chapter 8
What is Quantitative Analysis?
Quantitative analysis is a scientific approach to
answering questions
Raw data are processed and manipulated
resulting in meaningful information
Raw Data
Quantitative
Analysis
Meaningful
Information
Quantitative data analysis
Making sense of the numbers for meaningful
interpretation
It involves:
1.Organizing the data
2.Doing the calculations
3.Interpreting the information
4.Explaining limitations
What are the Options for Summarizing
Distributions?
• Measures of Central Tendency:
• Mode
• Median
• Mean
What are the Options for Summarizing
Distributions?
• Measures of Variation:
• Range
• Interquartile range
• Variance
• Standard deviation
The Mode
The most frequent value in a distribution.
Respondent's Religious Preference (GSS94)
2000
Count
1000
0
PROTESTANT CATHOLIC
JEWISH
NONE
OTHER
RS RELIGIOUS PREFERENCE
In a distribution of Americans’religious affiliations,
Protestant Christian is the most frequently occurring
value—the largest single group.
The Median
The point that divides the distribution in half (the
50th percentile).
HIGHEST YEAR OF SCHOOL COMPLETED
Valid
Mis sing
Total
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
DK
Total
NAP
Frequency
4
1
3
6
12
15
19
29
109
85
102
168
929
277
321
146
433
97
119
46
64
3
2988
4
2992
Percent
.1
.0
.1
.2
.4
.5
.6
1.0
3.6
2.8
3.4
5.6
31.0
9.3
10.7
4.9
14.5
3.2
4.0
1.5
2.1
.1
99.9
.1
100.0
Valid Percent
.1
.0
.1
.2
.4
.5
.6
1.0
3.6
2.8
3.4
5.6
31.1
9.3
10.7
4.9
14.5
3.2
4.0
1.5
2.1
.1
100.0
Cumulative
Percent
.1
.2
.3
.5
.9
1.4
2.0
3.0
6.6
9.5
12.9
18.5
49.6
58.9
69.6
74.5
89.0
92.2
96.2
97.8
99.9
100.0
The median in a
frequency distribution
is determined by
identifying the value
corresponding to a
cumulative percentage
of 50.
The Mean
The mean is just the arithmetic average.
Mean = Sum of value of cases/number of
cases
The Mean, cont’d
For example, to calculate the mean value of eight
cases, add the values of all cases and divide by the
number of cases (N):
(28 + 117 + 42 + 10 + 77 + 51 + 64 + 55) /8 = 444/8 = 55.5
Measures of Variation
It is important to know that the median
household income in the United States is a
bit over $40,000 a year,
We need to know the Variation in income:
The fact that incomes range from zero up to
hundreds of millions of dollars
Measures of variation capture how widely or
densely spread income (for instance) is.
Measures of Variation
• Four measures of variation for
quantitative variables:
1. Range
2. Interquartile range
3. Variance
4. Standard deviation
The Range
Simplest measure of variation
Calculated as highest value in a distribution
minus lowest value, plus 1:
Range = Highest value – Lowest value + 1
It often is important to report the range of a
distribution, to identify the whole range of
possible values that might be encountered.
The Range, cont’d.
Say that you surveyed 10 people, and
asked them how many times they saw
the movie Star Wars, and their answers
looked like this:
The range for “times respondent
saw Star Wars” is 20 – 0 + 1= 21.
The range can be drastically altered by
one exceptionally high or low value
(termed an outlier), it’s not a good
summary measure for most purposes.
Number of times
Respondent saw
Star Wars:
0
2
2
3
4
4
5
20
2
1
Interquartile Range
The interquartile range avoids problem
created by outliers, by showing the range
where most cases lie.
Quartiles are points in a distribution
corresponding to the first 25% of cases, the
first 50% of cases, and first 75% of cases.
Interquartile Range
• Star Wars example: Number of times
respondents saw Star Wars,
• First 25% of cases fall within the range
of 0 and 1.75 times.
• Second quartile fall within the range of
1.75 and 2.5 times.
• Third quartile falls within 2.5 and 4.25
times.
• Last quartile is between 4.25 and 20
times.
Interquartile Range, cont’d
Interquartile range is the difference between
first quartile and third quartile (plus 1).
In Star Wars example, the interquartile range
is
4.25 – 1.75 + 1 = 3.50
Variance
Statistical definition: The average
squared deviation of each case from
the mean;
•Take each case’s distance from the
mean,
•Square that number, and
•Take the average of all such numbers.
Variance
Takes into account the amount by
which each case differs from the mean.
It is affected by outliers, such as the
person who saw Star Wars 20 times.
Mainly useful for computing the
standard deviation, which comes next.
Standard Deviation
The standard deviation is the square root of the
variance. It is the square root of the average squared
deviation of each case from the mean:
s=
å(Yi - Y) 2
N
Symbol key: ¯ Y = mean; N = number of cases; S =
sum over all cases; Yi = value of case i on variable Y;
= square root.
Standard Deviation
Standard deviation has mathematical properties
that make it the preferred measure of variability,
particularly when a variable is normally distributed.
10
8
6
4
2
Std. Dev = 12.67
Mean = 75.0
N = 25.00
0
45.0
65.0
85.0
Scores
Graph of a normal distribution looks like a bell, with one
“hump” in the middle, centered around the population mean,
and the number of cases tapering off on both sides of the
mean.
Normal Distribution
A normal distribution is symmetric: If you folded it in half at its center
(at the population mean), the two halves would match perfectly.
10
8
6
4
2
Std. Dev = 12.67
Mean = 75.0
N = 25.00
0
45.0
65.0
85.0
Scores
If a variable is normally distributed, 68% of the cases (2/3) lie between
plus and minus 1 standard deviation from the distribution’s mean, and
95% of the cases will lie between 1.96 standard deviations above and
below the mean.
Normal Distribution
• The normal curve is a tool used to
tell how far the sample is likely to
be off from the population
• How big a "margin of error" there is
likely to be in the poll results.
Margin of error
• Price that researchers pay for not talking
to everyone in population.
• Describes the range that the true answer
likely falls between if researcher had
talked to everyone instead of just a
sample.
• http://www.custominsight.com/articles/random-sample-
calculator.asp
Different Statistics for Different Data
Nominal
Mode
X
Ordinal Interval/Ratio
X
X
Median
X
X
Mean
X
X
Range
X
X
Interquartile Range
X
Variance
X
Standard Deviation
X
Relationships between variables
Crosstabulation (cross tabs) display the
distribution of one variable for each
category in another variable
Also known as bivariate distribution
Cross tabs are presented first with
frequencies and then with percentages
Crosstabulation of Voting in 2000 by Family Income: Cell Counts and
Percentages
FAMILY INCOME: CELL COUNTS
Voting
<$20,000
Voted
178
Did not vote
182
Total (n)
(360)
$20,000-$34,999 $35,000 - $59,999
239
135
(374)
364
168
(532)
$60,000+
761
193
(954)
FAMILY INCOME: PERCENTAGES
Voting
<$20,000
Voted
49%
Did not vote 51%
Total
100%
$20,000-$34,999 $35,000 - $59,999
64%
36%
100%
68%
32%
100%
Source: General Social Survey, 2004. Weighted.
$60,000+
80%
20%
100%
Summary statistics describe particular
features of a distribution and facilitate
comparison among distributions.
The next step is to test for
associations . . .
Which calculation do I use?
It depends on what you want to know.
Do you want to know how many individuals
checked each answer?
Frequency
Do you want the proportion of people who
answered in a certain way?
Percentage
Do you want the average number or average score? Mean
Do you want the middle value in a range of values
or scores?
Median
Do you want to show the range in answers or
scores?
Range
Do you want to compare one group to another?
Cross tab
Do you want to report changes from pre to post?
Change score
Do you want to show the degree to which a
response varies from the mean?
Standard deviation