Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The Scientific Study of Politics (POL 51) Professor B. Jones University of California, Davis Fun With Numbers Some Univariate Statistics Learning to Describe Data Useful to Visualize Data 20 10 0 Frequency 30 40 Histogram 0 1000 2000 Variable Y 3000 4000 Main Features Exhibits “Right Skew” Some “Outlying” Data Points? Question: Are the outlying data points also “influential” data points (on measures of central tendency)? Let’s check… The Mean Formally, the mean is given by: Y1 Y2 YN Y N Or more compactly: N Y Y i 1 N i Our Data Mean of Y is 260.67 Mechanically… (263 + 73 + … + 88)/67=260.67 Problems with the mean? No indication of dispersion or variability. Variance The variance is a statistic that describes (squared) deviations around the mean: Why “N-1”? Interpretation: “Average squared deviations from the mean.” N ^ 2 _ (Y Y ) i 1 i N 1 2 Our Data Variance= 202,431.8 Mechanically: [(263-260.67)2 + (73-260.67)2 + ••• + (88-260.67)2 ]/66 Interpretation: “The average squared deviation around Y is 202,431. Rrrrright. (Who thinks in terms of squared deviations??) Answer: no one. That’s why we have a standard deviation. Standard Deviation Take the square root of the variance and you get the standard deviation. Why we like this: Metric is now in original units of Y. N ^ _ 2 ( Y Y ) i i 1 N 1 Interpretation S.D. gives “average deviation” around the mean. It’s a measure of dispersion that is in a metric that makes sense to us. Our Data The standard deviation is: 449.92 Mechanically: {[(263-260.67)2 + (73-260.67)2 + ••• + (88-260.67)2 ]/66}½ Interpretation: “The average deviation around the mean of 260.67 is 449.92. Now, suppose Y=Votes… The average number of votes is “about 261 and the average deviation around this number is about 450 votes.” The dispersion is very large. (Imagine the opposite case: mean test score is 85 percent; average deviation is 5 percent.) Revisiting our Data 20 10 0 Frequency 30 40 Histogram 0 1000 2000 Variable Y 3000 4000 Skewness and The Mean Data often exhibit skew. This is often true with political variables. We have a measure of central tendency and deviation about this measure (Mean, s.d) However, are there other indicators of central tendency? How about the median? Median “50th” Percentile: Location at which 50 percent of the cases lie above; 50 percent lie below. Since it’s a locational measure, you need to “locate it.” Example Data: 32, 5, 23, 99, 54 As is, not informative. Median Rank it: 5, 23, 32, 54, 99 Median Location=(N+1)/2 (when n is odd) =6/2=3 Location of the median is data point 3 This is 32. Hence, M=32, not 3!! Interpretation: “50 percent of the data lie above 32; 50 percent of the data lie below 32.” What would the mean be? (42.6…data are __________ skewed) Median When n is even: -67, 5, 23, 32, 54, 99 M is usually taken to be the average of the two middle scores: (N+1)/2=7/2=3.5 The median location is 3.5 which is between 23 and 32 M=(23+32)/2=27.5 All pretty straightforward stuff. Median Voter Theorem (a sidetrip) One of the most fundamental results in social sciences is Duncan Black’s Median Voter Theorem (1948) Theorem predicts convergence to median position. Why do parties tend to drift toward the center? Why do firms locate in close proximity to one another? The theorem: “given single-peaked preferences, majority voting, an odd number of decision makers, and a unidimensional issue space, the position taken by the median voter has an empty winset.” That is, under these general conditions, all we need to know is the preference of the median chooser to determine what the outcome will be. No position can beat the median. Dispersion around the Median The mean has its standard deviation… What about the median? No such thing as “standard deviation” per se, around the median. But, there is the IQR Interquartile Range The median is the 50th percentile. Suppose we compute the 25th and the 75th percentiles and then take the difference. 25th Percentile is the “median” of the lower half of the data; the 75th Percentile is the “median” of the upper half. IQR and the 5 Number Summary Data: -67, 5, 23, 32, 54, 99 25th Percentile=5 50th Percentile=54 IQR is difference between 75th and 25th percentiles: 54- 5=49 Hence, M=27.5; IQR=49 “Five Number Summary” Max, Min, 25th, 50th, 75th Percentiles: -67, 5, 27.5, 54, 99 Finding Percentiles General Formula p is desired percentile n is sample size If L is a whole number: The value of the pth percentile is between the Lth value and the next value. Find the mean of those values If L is not a whole number: Round L up. The value of the pth percentile is the Lth value pn L 100 Example -67, 5, 23, 32, 54, 99 25th Percentile: L=(25*6)/100=1.5 Round to 2. The 25th Percentile is 5. 75th Percentile: L=(75*6)/100=4.5 Round to 5. The 75th Percentile is 54. 50th Percentile: L=(50*6)/100=3 Take average of locations 3 and 4 This is (23+32)/2=27.5. Our Data Median=120 Votes (i.e. [50*67]/100) 25th Percentile=46 Votes 75th Percentile=289 Votes IQR: 243 Votes 5 number summary: Min=9, 25th P=46, Median=120, 75th P=289, Max=3407 (massive dispersion!) Mean was 260.67. Median=120. The Mean is much closer to the 75th percentile. That’s SKEW in action. Revisiting our Data: Odd Ball Cases 20 10 0 Frequency 30 40 Histogram 0 1000 2000 Variable Y 3000 4000 “Influential Observations” Two data points: Y=(1013, 3407) Suppose we omit them (not recommended in applied research) Mean plummets to 200.69 (drop of 60 votes) s.d. is cut by more than half: 203.92 Med=114 (note, it hardly changed) Let’s look at a scatterplot Useful to Visualize Data 0 1000 2000 Y 3000 4000 Scatterplot 0 100000 200000 X 300000 Main Features? Y and X are positively related. There are clearly visible “outliers.” With respect to Y, which “outlier” worries you most? 4000 3000 2000 Y 0 1000 Influence! Scatterplot 0 100000 200000 X 300000 Simple Description You can learn a lot from just these simple indicators. Suppose that our Y was a real variable? Palm Beach County, FL 2000 Election Descriptive Statistics Help to Clarify Some Issues. Palm Beach County Largely a Jewish community Heavily Democratic Yet an overwhelming number of Buchanan Votes The Ballot created massive confusion. Margin of Victory in Florida: 537 votes. Number of Buchanan Votes in PBC: 3407 4000 Buchanan by Bush Vote in Florida 1000 2000 3000 PALM BEACH 0 DUVAL PASCO BREVARD MARION POLK ESCAMBIA VOLUSIA ORANGE ST. JOHNS ST. LUCIE LEE LAKE LEON CITRUS MANATEE OKALOOSA ALACHUA BAY HERNANDO SARASOTA SANTA ROSA CLAYCOLLIER CHARLOTTE PUTNAM OSCEOLA HIGHLANDS SEMINOLE WALTON SUMTER SUWANNEE MARTIN INDIAN RIVER JACKSON CALHOUN NASSAU WASHINGTON COLUMBIA FLAGLER HOLMES GULF BAKER LEVY BRADFORD WAKULLA MONROE LIBERTY OKEECHOBEE UNION DE GADSDEN SOTO FRANKLIN HARDEE JEFFERSON DIXIE MADISON GILCHRIST TAYLOR HAMILTON HENDRY LAFAYETTE GLADES 0 100000 PINELLAS HILLSBOROUGH BROWARD DADE 200000 Vote for Bush 300000 4000 2000 3000 PALM BEACH 1000 PINELLAS HILLSBOROUGH BROWARD DUVAL PASCO BREVARD MARION POLKVOLUSIA ESCAMBIA ORANGE ST. JOHNS ST. LEELUCIE LAKE LEON CITRUS MANATEE OKALOOSA ALACHUA BAY HERNANDO SARASOTA SANTA ROSA CLAY CHARLOTTE PUTNAM OSCEOLA HIGHLANDS COLLIER SEMINOLE WALTON SUMTER SUWANNEE MARTIN INDIAN RIVER JACKSON CALHOUN NASSAU WASHINGTON COLUMBIA FLAGLER HOLMES BAKER GULF LEVY BRADFORD WAKULLA MONROE LIBERTY OKEECHOBEE UNION DE GADSDEN SOTO FRANKLIN HARDEE DIXIE GILCHRIST TAYLOR JEFFERSON MADISON HAMILTON HENDRY LAFAYETTE GLADES 0 Vote for Buchanan Buchanan by Gore Vote 0 100000 200000 Vote for Gore DADE 300000 400000 Univariate Statistics We can clearly learn a lot from very simple statistics Some quick illustrations in R using data from last year’s election (on Prop. 8) Univariate Quantities in R Our Data Yes on Proposition 8 by County Graphical Displays of data Histogram Dot Chart Box Plots Stem and Leaf Strip Plot First, the basic statistics in R Mean (by county): > mean(proportionforprop8) [1] 56.7202 Standard deviation: > sd(proportionforprop8) [1] 13.39508 Five-number summary: > fivenum(proportionforprop8) [1] 23.50787 46.93203 59.25364 68.03883 75.37070 Histogram 0 5 Frequency 10 15 Histogram of Yes on 8 by County 20 30 40 50 60 Percentage Yes on Prop. 8 70 80 Dot Chart Yes on 8 by County Tulare Kern Modoc Kings Madera Glenn Tehama Colusa Merced Lassen Sutter Imperial Shasta Fresno Stanislaus Yuba San Bernardino San Joaquin Riverside Amador Sierra Calaveras Tuolumne Mariposa Inyo Del Norte Siskiyou Plumas Placer El Dorado Orange Butte Trinity Solano San Benito San Diego Sacramento Ventura Lake San Luis Obispo Nevada Los Angeles Monterey Santa Barbara Contra Costa Napa Mono Santa Clara Alpine Yolo Humboldt Alameda Mendocino San Mateo Sonoma Santa Cruz Marin San Francisco 0 20 40 60 Percent Yes 80 100 Box Plot 50 40 30 Percent Yes on 8 60 70 Box Plots for Prop. 8 by County California Counties Source: Los Angeles Times Stem and Leaf The decimal point is 1 digit(s) to the right of the | 2 | 459 3 | 4888 4 | 024445578 5 | 0113344566779 6 | 0000023344457889 7 | 0011123334455 Strip Plot Vote on Prop. 8 by County: Strip Chart 30 40 50 Percentage Yes on Prop. 8 60 70 R Code for Previous row.names<- cbind(county) hist(proportionforprop8, xlab="Percentage Yes on Prop. 8", ylab="Frequency", main="Histogram of Yes on 8 by County", col="yellow") dotchart(proportionforprop8, labels=row.names, cex=.7, xlim=c(0, 100), main="Yes on 8 by County", xlab="Percent Yes") abline(v=50) abline(h=16) boxplot(proportionforprop8, col="light blue", names=c("Proposition 8"), xlab="California Counties", ylab="Percent Yes on 8", main="Box Plots for Prop. 8 by County", sub="Source: Los Angeles Times") abline(h=50) stem(proportionforprop8) stripchart(proportionforprop8, method="stack", xlab="Percentage Yes on Prop. 8", main="Vote on Prop. 8 by County: Strip Chart", pch=1) Combined Box Plots for Prop. 8 by County 60 50 30 40 Percent Yes on 8 10 5 0 Frequency 70 15 Histogram of Yes on 8 by County 20 30 40 50 60 70 80 Percentage Yes on Prop. 8 California Counties Source: Los Angeles Times Yes on 8 by County Vote on Prop. 8 by County: Strip Chart Tulare Kern Modoc Kings Madera Glenn Tehama Colusa Merced Lassen Sutter Imperial Shasta Fresno Stanislaus Yuba San Joaquin Bernardino San Riverside Amador Sierra Calaveras Tuolumne Mariposa Inyo Del Norte Siskiyou Plumas Placer El Dorado Orange Butte Trinity Solano San Benito San Diego Sacramento Ventura Lake San Luis Obispo Nevada Los Angeles Monterey Santa Barbara Contra Napa Costa Mono Santa Clara Alpine Yolo Humboldt Alameda Mendocino San Mateo Sonoma Santa Cruz Marin San Francisco 0 20 40 60 Percent Yes 80 100 30 40 50 60 Percentage Yes on Prop. 8 70 Plots of Two Variables 70 Prop. 8 Vote by Prop. 4 Vote 50 40 30 Prop. 4 Vote 60 plot(proportionforprop8, proportionforprop4, xlab="Prop. 8 Vote", ylab="Prop. 4 Vote", main="Prop. 8 Vote by Prop. 4 Vote“, col=“red”) > abline(h=50) > abline(v=50) 30 40 50 Prop. 8 Vote 60 70