Download Announcement

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Time series wikipedia , lookup

Misuse of statistics wikipedia , lookup

Categorical variable wikipedia , lookup

Transcript
Announcement


Homework #2 due next Friday at 5pm.
Midterm is in 2 weeks. It will cover everything
through the end of next week (week 5).
Political Science 15
Lecture 8:
Descriptive Statistics (Part 1)
Data Coding

Coding is the process of assigning numerical
values to the values of your variable.

The meaning of these codes will depend on the
level of measurement of the variable:
Nominal: codes are just indications of the category
 Ordinal: codes are indications of ordering
 Interval/Ratio: codes are the actual numerical value

Preparing Data for
Hypothesis Testing



Gather measurements on all of the concepts
important for your hypothesis (dependent,
independent, and control variables). Enter them
into a spreadsheet.
We will use SPSS in this class.
Each row is an observation (unit), each column
is a variable.
Example of Data Ready for
Hypothesis Testing
Interview #
Religion
Income
Ideology
1
1
35000
4
2
1
46000
3
3
3
82000
5
4
2
19000
2
5
1
67000
6
We use a codebook to find out what
these numbers mean.
Descriptive Statistics



Descriptive statistics can be used for descriptive
inference – using data to learn something about
the state of the world.
These descriptive statistics will also be the
building blocks we use for causal inference –
testing our hypotheses with data to learn
something about how the world works.
We begin with descriptive statistics for a single
variable.
Understanding Our Data



Before undertaking any data analysis you should
examine your data carefully.
Watch for unusual distributions of variable
values and outliers in the data.
An outlier is an extreme value on a variable. Try
to determine why you have observed this value.
An unusual case? A coding error?
Example of an outlier affecting a
relationship
Exploring Data:
Frequency Distributions
Divide the variable into a set of exhaustive,
mutually exclusive categories.
 Example:
Cumulative
Ideology
# of people Percent
Percent
Conservative 300
30%
30%
Moderate
500
50%
80%
Liberal
200
20%
100%
Total
1000
100%
100%

Exploring Data:
Graphical Methods

For nominal and ordinal level data bar graphs
work well:
Exploring Data:
Graphical Methods

For interval level data a histogram is useful (note
detection of outlier):
Central Tendency: Mode




The mode is the category of a variable with the
greatest frequency of observations.
The mode is most commonly used on variables
with a nominal level of measurement.
There can be more than one modal value for a
variable. Variables with more than one mode
are referred to as bimodal or multimodal.
Example: In a party ID variable we have 40
Democrats, 60 Republicans, and 20
Independents  the mode is “Republican.”
Central Tendency: Median




The median is the value of a variable that divides
the observations on that variable in half.
If we ordered our observations on a variable
from lowest to highest, the median observation
is the one in the middle.
With an even number of observations there is
no true median.
The median is most commonly used on
variables with an ordinal level of measurement,
but is sometimes used on interval/ratio data
because it is resistant to outliers.
Example of Calculating Median

We have a 7-point scale on ideology in a survey:
Category: 1 2 3 4 5
# responses: 32 54 97 103 44

6 7
21 12
The median observation is observation
(N+1)/2 = 182. Count up from the lowest
value  median is 3.
Quartiles






If we arrange a variable from lowest to highest
value, the median is the observation at the 50%
mark.
Quartiles are at the 25%, 50% and 75% marks.
Quintiles: 20%, 40%, 60%, 80%
Deciles: every 10%
Percentiles: every 1%
We can use these to get a more detailed picture
of the distribution of a variable.
Central Tendency: Mean



The mean is the sum of the values of a variable
divided by the number of observations on that
variable. This is usually what people mean by
“average.”
The formula for the mean is written as:
The mean is most commonly used on variables
with an interval level of measurement.
Example of Calculating Mean

We have campaign spending in 7 districts:
District: 1
2
3
4 5
6
7
$ spent: 1000 5000 3500 2000 0 800 6000


ΣX = 1000 + 5000 + 3500 + 2000 + 0 + 800 +
6000 = 18300. N = 7
The mean is 18300/7 = 2614.
Central Tendencies in Global
Income Distribution
Dispersion: Standard Deviation




The variance of a variable is the sum of the
squared differences between each value of that
variable and the mean, divided by N – 1.
We square the differences so that positive and
negative differences don’t cancel out.
We divide by N –1 to get a (conservative)
estimate of the mean dispersion of the variable.
The square root of the variance is the standard
deviation:
Example of Calculating
Standard Deviation

We have campaign spending in 7 districts:
District: 1
2
3
4 5
6
7
$ spent: 1000 5000 3500 2000 0 800 6000
 Mean of variable is 2614.
 s = square root of [1/6 ((1000 – 2614)2 +
(5000 – 2614)2 + …))]
The standard deviation is 2106.
z scores


A z score is a measure of how many standard
deviations a particular observation is above or
below the mean.
We subtract the mean from the observation and
divide by the standard deviation.
Example of Calculating
z scores

We have campaign spending in 7 districts:
District: 1
2
3
4
5
6
7
$ spent: 1000 5000 3500 2000 0 800 6000




Mean of variable is 2614
Standard deviation of variable is 2106.
z score for district 1 is (1000 – 2614)/2106 = -0.77
z score for district 2 is (5000 – 2614)/2106 = 1.13
Descriptive Statistics for
Relationships Between Variables



These are the more interesting descriptive
statistics from our perspective, since we are
interested in testing causal relationships between
variables.
Our hypothesis tests later in the class will usually
be based on these calculations.
As with a single variable, we begin by exploring
our data to be sure we understand it.
Exploring Data:
Bivariate Frequency Distributions


Divide the variables into a set of exhaustive,
mutually exclusive categories.
Example:
Party ID
Party ID
Favors
Gas Tax
Opposes
Gas Tax
Democrat
50%
N=500
10%
N=100
Republican
10%
N=100
30%
N=300
Examples of Relationships in
Crosstabs
Dem Rep
Yes
No
25% 25%
25% 25%
Dem Rep
Dem Rep
Yes
40% 10%
Yes
10% 40%
No
10% 40%
No
40% 10%
No
Yes
No
Our hypothesis is that Democrats are more
supportive of a gas tax. Do our data support this?
Exploring Data:
Graphical Methods

For interval level data scatterplots are a good
way to examine relationships between variables :
Correlations

Correlations measure the relationship between two
interval level variables.

Correlations always fall between –1 and 1.
Positive correlations indicate a positive relationship,
negative correlations indicate a negative relationship.
No relationship gives a 0 correlation, but 0 correlation
does not necessarily mean no relationship.
Correlations only capture linear relationships:
y = a + b*x



Positive Correlations
Stronger
Weaker
Negative Correlations
Stronger
Weaker
Examples of Correlations