Download The Scientific Study of Politics (POL 51)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
The Scientific Study of
Politics (POL 51)
Professor B. Jones
University of California, Davis
Fun With Numbers
 Some Univariate Statistics
 Learning to Describe Data
Useful to Visualize Data
20
10
0
Frequency
30
40
Histogram
0
1000
2000
Variable
Y
3000
4000
Main Features
 Exhibits “Right Skew”
 Some “Outlying” Data Points?
 Question: Are the outlying data points also “influential”
data points (on measures of central tendency)?
 Let’s check…
The Mean
 Formally, the mean is
given by:
Y1  Y2      YN
Y
N
 Or more compactly:
N
Y
Y
i 1
N
i
Our Data
 Mean of Y is 260.67
 Mechanically…
 (263 + 73 + … + 88)/67=260.67
 Problems with the mean?
 No indication of dispersion or variability.
Variance
 The variance is a
statistic that describes
(squared) deviations
around the mean:
 Why “N-1”?
 Interpretation:
“Average squared
deviations from the
mean.”
N
^

2

_
 (Y  Y )
i 1
i
N 1
2
Our Data
 Variance= 202,431.8
 Mechanically:
 [(263-260.67)2 + (73-260.67)2 + ••• + (88-260.67)2 ]/66
 Interpretation:
 “The average squared deviation around Y is 202,431.
 Rrrrright. (Who thinks in terms of squared
deviations??)
 Answer: no one.
 That’s why we have a standard deviation.
Standard Deviation
 Take the square root of the variance and you get
the standard deviation.
 Why we like this:
 Metric is now in original units of Y.
N
^


_
2
(
Y

Y
)
 i
i 1
N 1
 Interpretation
 S.D. gives “average deviation” around the mean.
 It’s a measure of dispersion that is in a metric that
makes sense to us.
Our Data
 The standard deviation is: 449.92
 Mechanically:
{[(263-260.67)2 + (73-260.67)2 + ••• + (88-260.67)2 ]/66}½
 Interpretation: “The average deviation around the mean of 260.67 is
449.92.
 Now, suppose Y=Votes…
 The average number of votes is “about 261 and the average deviation
around this number is about 450 votes.”
 The dispersion is very large.
 (Imagine the opposite case: mean test score is 85 percent; average
deviation is 5 percent.)
Revisiting our Data
20
10
0
Frequency
30
40
Histogram
0
1000
2000
Variable
Y
3000
4000
Skewness and The Mean
 Data often exhibit skew.
 This is often true with political variables.
 We have a measure of central tendency and deviation
about this measure (Mean, s.d)
 However, are there other indicators of central tendency?
 How about the median?
Median
 “50th” Percentile: Location at which 50 percent of the cases
lie above; 50 percent lie below.
 Since it’s a locational measure, you need to “locate it.”
 Example Data: 32, 5, 23, 99, 54
 As is, not informative.
Median
 Rank it: 5, 23, 32, 54, 99
 Median Location=(N+1)/2 (when n is odd)
 =6/2=3
 Location of the median is data point 3
 This is 32.
 Hence, M=32, not 3!!
 Interpretation: “50 percent of the data lie above 32; 50 percent of the
data lie below 32.”
 What would the mean be?
 (42.6…data are __________ skewed)
Median
 When n is even: -67, 5, 23, 32, 54, 99
 M is usually taken to be the average of the two middle
scores:
 (N+1)/2=7/2=3.5
 The median location is 3.5 which is between 23 and 32
 M=(23+32)/2=27.5
 All pretty straightforward stuff.
Median Voter Theorem (a sidetrip)
 One of the most fundamental results in social sciences is
Duncan Black’s Median Voter Theorem (1948)
 Theorem predicts convergence to median position.
 Why do parties tend to drift toward the center?
 Why do firms locate in close proximity to one another?
 The theorem: “given single-peaked preferences, majority
voting, an odd number of decision makers, and a
unidimensional issue space, the position taken by the median
voter has an empty winset.”
 That is, under these general conditions, all we need to know is
the preference of the median chooser to determine what the
outcome will be. No position can beat the median.
Dispersion around the Median
 The mean has its standard deviation…
 What about the median?
 No such thing as “standard deviation” per se, around the median.
 But, there is the IQR
 Interquartile Range
 The median is the 50th percentile.
 Suppose we compute the 25th and the 75th percentiles and then
take the difference.
 25th Percentile is the “median” of the lower half of the data; the 75th
Percentile is the “median” of the upper half.
IQR and the 5 Number Summary
 Data: -67, 5, 23, 32, 54, 99
 25th Percentile=5
 50th Percentile=54
 IQR is difference between 75th and 25th percentiles: 54-
5=49
 Hence, M=27.5; IQR=49
 “Five Number Summary” Max, Min, 25th, 50th, 75th
Percentiles:
 -67, 5, 27.5, 54, 99
Finding Percentiles




General Formula
p is desired percentile
n is sample size
If L is a whole number:
 The value of the pth percentile is
between the Lth value and the
next value. Find the mean of
those values
 If L is not a whole number:
 Round L up. The value of the pth
percentile is the Lth value
pn
L
100
Example
 -67, 5, 23, 32, 54, 99
 25th Percentile: L=(25*6)/100=1.5
 Round to 2. The 25th Percentile is 5.
 75th Percentile: L=(75*6)/100=4.5
 Round to 5. The 75th Percentile is 54.
 50th Percentile: L=(50*6)/100=3
 Take average of locations 3 and 4
 This is (23+32)/2=27.5.
Our Data
 Median=120 Votes (i.e. [50*67]/100)
 25th Percentile=46 Votes
 75th Percentile=289 Votes
 IQR: 243 Votes
 5 number summary:


Min=9, 25th P=46, Median=120, 75th P=289, Max=3407
(massive dispersion!)
 Mean was 260.67. Median=120.
 The Mean is much closer to the 75th percentile.
 That’s SKEW in action.
Revisiting our Data: Odd Ball Cases
20
10
0
Frequency
30
40
Histogram
0
1000
2000
Variable
Y
3000
4000
“Influential Observations”
 Two data points:
 Y=(1013, 3407)
 Suppose we omit them (not recommended in applied
research)
 Mean plummets to 200.69 (drop of 60 votes)
 s.d. is cut by more than half: 203.92
 Med=114 (note, it hardly changed)
 Let’s look at a scatterplot
Useful to Visualize Data
0
1000
2000
Y
3000
4000
Scatterplot
0
100000
200000
X
300000
Main Features?
 Y and X are positively related.
 There are clearly visible “outliers.”
 With respect to Y, which “outlier” worries you most?
4000
3000
2000
Y
0
1000
 Influence!
Scatterplot
0
100000
200000
X
300000
Simple Description
 You can learn a lot from just these simple indicators.
 Suppose that our Y was a real variable?
Palm Beach County, FL
2000 Election
Descriptive Statistics Help to Clarify
Some Issues.
 Palm Beach County
 Largely a Jewish community
 Heavily Democratic
 Yet an overwhelming number of Buchanan Votes
 The Ballot created massive confusion.
 Margin of Victory in Florida: 537 votes.
 Number of Buchanan Votes in PBC: 3407
4000
Buchanan by Bush Vote in Florida
1000
2000
3000
PALM BEACH
0
DUVAL
PASCO
BREVARD
MARION
POLK
ESCAMBIA
VOLUSIA
ORANGE
ST.
JOHNS
ST.
LUCIE
LEE
LAKE
LEON
CITRUS
MANATEE
OKALOOSA
ALACHUA
BAY
HERNANDO
SARASOTA
SANTA ROSA
CLAYCOLLIER
CHARLOTTE
PUTNAM
OSCEOLA
HIGHLANDS
SEMINOLE
WALTON
SUMTER
SUWANNEE
MARTIN
INDIAN
RIVER
JACKSON
CALHOUN
NASSAU
WASHINGTON
COLUMBIA
FLAGLER
HOLMES
GULF
BAKER
LEVY
BRADFORD
WAKULLA
MONROE
LIBERTY
OKEECHOBEE
UNION
DE
GADSDEN
SOTO
FRANKLIN
HARDEE
JEFFERSON
DIXIE
MADISON
GILCHRIST
TAYLOR
HAMILTON
HENDRY
LAFAYETTE
GLADES
0
100000
PINELLAS
HILLSBOROUGH
BROWARD
DADE
200000
Vote for Bush
300000
4000
2000
3000
PALM BEACH
1000
PINELLAS
HILLSBOROUGH
BROWARD
DUVAL
PASCO
BREVARD
MARION
POLKVOLUSIA
ESCAMBIA
ORANGE
ST.
JOHNS
ST.
LEELUCIE
LAKE
LEON
CITRUS
MANATEE
OKALOOSA
ALACHUA
BAY
HERNANDO
SARASOTA
SANTA ROSA
CLAY
CHARLOTTE
PUTNAM
OSCEOLA
HIGHLANDS
COLLIER
SEMINOLE
WALTON
SUMTER
SUWANNEE
MARTIN
INDIAN
RIVER
JACKSON
CALHOUN
NASSAU
WASHINGTON
COLUMBIA
FLAGLER
HOLMES
BAKER
GULF
LEVY
BRADFORD
WAKULLA
MONROE
LIBERTY
OKEECHOBEE
UNION
DE
GADSDEN
SOTO
FRANKLIN
HARDEE
DIXIE
GILCHRIST
TAYLOR
JEFFERSON
MADISON
HAMILTON
HENDRY
LAFAYETTE
GLADES
0
Vote for Buchanan
Buchanan by Gore Vote
0
100000
200000
Vote for Gore
DADE
300000
400000
Univariate Statistics
 We can clearly learn a lot from very simple statistics
 Some quick illustrations in R using data from last year’s
election (on Prop. 8)
Univariate Quantities in R
 Our Data
 Yes on Proposition 8 by County
 Graphical Displays of data
 Histogram
 Dot Chart
 Box Plots
 Stem and Leaf
 Strip Plot
First, the basic statistics in R
 Mean (by county):
 > mean(proportionforprop8)
 [1] 56.7202
 Standard deviation:
 > sd(proportionforprop8)
 [1] 13.39508
 Five-number summary:
 > fivenum(proportionforprop8)
 [1] 23.50787 46.93203 59.25364 68.03883 75.37070
Histogram
0
5
Frequency
10
15
Histogram of Yes on 8 by County
20
30
40
50
60
Percentage Yes on Prop. 8
70
80
Dot Chart
Yes on 8 by County
Tulare
Kern
Modoc
Kings
Madera
Glenn
Tehama
Colusa
Merced
Lassen
Sutter
Imperial
Shasta
Fresno
Stanislaus
Yuba
San Bernardino
San Joaquin
Riverside
Amador
Sierra
Calaveras
Tuolumne
Mariposa
Inyo
Del Norte
Siskiyou
Plumas
Placer
El Dorado
Orange
Butte
Trinity
Solano
San Benito
San Diego
Sacramento
Ventura
Lake
San Luis Obispo
Nevada
Los Angeles
Monterey
Santa Barbara
Contra Costa
Napa
Mono
Santa Clara
Alpine
Yolo
Humboldt
Alameda
Mendocino
San Mateo
Sonoma
Santa Cruz
Marin
San Francisco
0
20
40
60
Percent Yes
80
100
Box Plot
50
40
30
Percent Yes on 8
60
70
Box Plots for Prop. 8 by County
California Counties
Source: Los Angeles Times
Stem and Leaf
The decimal point is 1 digit(s) to the right
of the |
2 | 459
3 | 4888
4 | 024445578
5 | 0113344566779
6 | 0000023344457889
7 | 0011123334455
Strip Plot
Vote on Prop. 8 by County: Strip Chart
30
40
50
Percentage Yes on Prop. 8
60
70
R Code for Previous
row.names<- cbind(county)
hist(proportionforprop8, xlab="Percentage Yes on Prop.
8", ylab="Frequency", main="Histogram of Yes on 8 by
County", col="yellow")
dotchart(proportionforprop8, labels=row.names,
cex=.7, xlim=c(0, 100), main="Yes on 8 by County",
xlab="Percent Yes")
abline(v=50)
abline(h=16)
boxplot(proportionforprop8, col="light blue",
names=c("Proposition 8"), xlab="California Counties",
ylab="Percent Yes on 8", main="Box Plots for Prop. 8
by County", sub="Source: Los Angeles Times")
abline(h=50)
stem(proportionforprop8)
stripchart(proportionforprop8, method="stack",
xlab="Percentage Yes on Prop. 8", main="Vote on
Prop. 8 by County: Strip Chart", pch=1)
Combined
Box Plots for Prop. 8 by County
60
50
30
40
Percent Yes on 8
10
5
0
Frequency
70
15
Histogram of Yes on 8 by County
20
30
40
50
60
70
80
Percentage Yes on Prop. 8
California Counties
Source: Los Angeles Times
Yes on 8 by County
Vote on Prop. 8 by County: Strip Chart
Tulare
Kern
Modoc
Kings
Madera
Glenn
Tehama
Colusa
Merced
Lassen
Sutter
Imperial
Shasta
Fresno
Stanislaus
Yuba
San Joaquin
Bernardino
San
Riverside
Amador
Sierra
Calaveras
Tuolumne
Mariposa
Inyo
Del
Norte
Siskiyou
Plumas
Placer
El
Dorado
Orange
Butte
Trinity
Solano
San
Benito
San
Diego
Sacramento
Ventura
Lake
San
Luis Obispo
Nevada
Los
Angeles
Monterey
Santa Barbara
Contra
Napa Costa
Mono
Santa Clara
Alpine
Yolo
Humboldt
Alameda
Mendocino
San
Mateo
Sonoma
Santa Cruz
Marin
San Francisco
0
20
40
60
Percent Yes
80
100
30
40
50
60
Percentage Yes on Prop. 8
70
Plots of Two Variables
70
Prop. 8 Vote by Prop. 4 Vote
50
40
30
Prop. 4 Vote
60
plot(proportionforprop8, proportionforprop4,
xlab="Prop. 8 Vote", ylab="Prop. 4 Vote",
main="Prop. 8 Vote by Prop. 4 Vote“, col=“red”)
> abline(h=50)
> abline(v=50)
30
40
50
Prop. 8 Vote
60
70