Download Announcement

Announcement   Homework #2 due next Friday at 5pm. Midterm is in 2 weeks. It will cover everything through the end of next week (week 5). Political Science 15 Lecture 8: Descriptive Statistics (Part 1) Data Coding  Coding is the process of assigning numerical values to the values of your variable.  The meaning of these codes will depend on the level of measurement of the variable: Nominal: codes are just indications of the category  Ordinal: codes are indications of ordering  Interval/Ratio: codes are the actual numerical value  Preparing Data for Hypothesis Testing    Gather measurements on all of the concepts important for your hypothesis (dependent, independent, and control variables). Enter them into a spreadsheet. We will use SPSS in this class. Each row is an observation (unit), each column is a variable. Example of Data Ready for Hypothesis Testing Interview # Religion Income Ideology 1 1 35000 4 2 1 46000 3 3 3 82000 5 4 2 19000 2 5 1 67000 6 We use a codebook to find out what these numbers mean. Descriptive Statistics    Descriptive statistics can be used for descriptive inference – using data to learn something about the state of the world. These descriptive statistics will also be the building blocks we use for causal inference – testing our hypotheses with data to learn something about how the world works. We begin with descriptive statistics for a single variable. Understanding Our Data    Before undertaking any data analysis you should examine your data carefully. Watch for unusual distributions of variable values and outliers in the data. An outlier is an extreme value on a variable. Try to determine why you have observed this value. An unusual case? A coding error? Example of an outlier affecting a relationship Exploring Data: Frequency Distributions Divide the variable into a set of exhaustive, mutually exclusive categories.  Example: Cumulative Ideology # of people Percent Percent Conservative 300 30% 30% Moderate 500 50% 80% Liberal 200 20% 100% Total 1000 100% 100%  Exploring Data: Graphical Methods  For nominal and ordinal level data bar graphs work well: Exploring Data: Graphical Methods  For interval level data a histogram is useful (note detection of outlier): Central Tendency: Mode     The mode is the category of a variable with the greatest frequency of observations. The mode is most commonly used on variables with a nominal level of measurement. There can be more than one modal value for a variable. Variables with more than one mode are referred to as bimodal or multimodal. Example: In a party ID variable we have 40 Democrats, 60 Republicans, and 20 Independents  the mode is “Republican.” Central Tendency: Median     The median is the value of a variable that divides the observations on that variable in half. If we ordered our observations on a variable from lowest to highest, the median observation is the one in the middle. With an even number of observations there is no true median. The median is most commonly used on variables with an ordinal level of measurement, but is sometimes used on interval/ratio data because it is resistant to outliers. Example of Calculating Median  We have a 7-point scale on ideology in a survey: Category: 1 2 3 4 5 # responses: 32 54 97 103 44  6 7 21 12 The median observation is observation (N+1)/2 = 182. Count up from the lowest value  median is 3. Quartiles       If we arrange a variable from lowest to highest value, the median is the observation at the 50% mark. Quartiles are at the 25%, 50% and 75% marks. Quintiles: 20%, 40%, 60%, 80% Deciles: every 10% Percentiles: every 1% We can use these to get a more detailed picture of the distribution of a variable. Central Tendency: Mean    The mean is the sum of the values of a variable divided by the number of observations on that variable. This is usually what people mean by “average.” The formula for the mean is written as: The mean is most commonly used on variables with an interval level of measurement. Example of Calculating Mean  We have campaign spending in 7 districts: District: 1 2 3 4 5 6 7 $ spent: 1000 5000 3500 2000 0 800 6000   ΣX = 1000 + 5000 + 3500 + 2000 + 0 + 800 + 6000 = 18300. N = 7 The mean is 18300/7 = 2614. Central Tendencies in Global Income Distribution Dispersion: Standard Deviation     The variance of a variable is the sum of the squared differences between each value of that variable and the mean, divided by N – 1. We square the differences so that positive and negative differences don’t cancel out. We divide by N –1 to get a (conservative) estimate of the mean dispersion of the variable. The square root of the variance is the standard deviation: Example of Calculating Standard Deviation  We have campaign spending in 7 districts: District: 1 2 3 4 5 6 7 $ spent: 1000 5000 3500 2000 0 800 6000  Mean of variable is 2614.  s = square root of [1/6 ((1000 – 2614)2 + (5000 – 2614)2 + …))] The standard deviation is 2106. z scores   A z score is a measure of how many standard deviations a particular observation is above or below the mean. We subtract the mean from the observation and divide by the standard deviation. Example of Calculating z scores  We have campaign spending in 7 districts: District: 1 2 3 4 5 6 7 $ spent: 1000 5000 3500 2000 0 800 6000     Mean of variable is 2614 Standard deviation of variable is 2106. z score for district 1 is (1000 – 2614)/2106 = -0.77 z score for district 2 is (5000 – 2614)/2106 = 1.13 Descriptive Statistics for Relationships Between Variables    These are the more interesting descriptive statistics from our perspective, since we are interested in testing causal relationships between variables. Our hypothesis tests later in the class will usually be based on these calculations. As with a single variable, we begin by exploring our data to be sure we understand it. Exploring Data: Bivariate Frequency Distributions   Divide the variables into a set of exhaustive, mutually exclusive categories. Example: Party ID Party ID Favors Gas Tax Opposes Gas Tax Democrat 50% N=500 10% N=100 Republican 10% N=100 30% N=300 Examples of Relationships in Crosstabs Dem Rep Yes No 25% 25% 25% 25% Dem Rep Dem Rep Yes 40% 10% Yes 10% 40% No 10% 40% No 40% 10% No Yes No Our hypothesis is that Democrats are more supportive of a gas tax. Do our data support this? Exploring Data: Graphical Methods  For interval level data scatterplots are a good way to examine relationships between variables : Correlations  Correlations measure the relationship between two interval level variables.  Correlations always fall between –1 and 1. Positive correlations indicate a positive relationship, negative correlations indicate a negative relationship. No relationship gives a 0 correlation, but 0 correlation does not necessarily mean no relationship. Correlations only capture linear relationships: y = a + b*x    Positive Correlations Stronger Weaker Negative Correlations Stronger Weaker Examples of Correlations

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Announcement