Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
PSC 211 Midterm Study Guide SCROLL DOWN TO PAGE 2 (note: I post this edited syllabus with all due humility, noting that it is not perfect and reflects my approaches to the problems we might have to do for the test, not necessarily yours.) {Original Material} You should be able to define and give the significance of the following. Where applicable, you should know the symbol that represents the concept and be familiar with the formula for deriving the statistic. Close-book section of the exam: Inductive reasoning Deductive reasoning Subjective vs. objective reasoning Descriptive statistics Inferential statistics randomness Data Case Unit of analysis Ecological fallacy Sample population Nominal, ordinal, interval, and ratio level variables Discrete verses continuous variables Frequency distributions Percentage distributions Cumulative distributions Unimodal distribution Sum of squares Variance Standard deviation Bimodal distribution Mean Median Mode Skewness N Normal distribution z-scores sampling distribution distribution of a sample population distribution standard error standardized variables 95% confidence interval 99% confidence interval Positive relationship Negative relationship Curvilinear relationship Central limit theorem Proportion Open Book Section of Exam There will be an open book, open notes section of the exam. In this portion, you will need to: Calculate the mean, median, and mode. Calculate frequency distributions, percentage distributions and cumulative percentage distributions. Calculate, the variance, standard deviation, z-scores, standard error, and confidence intervals for data sets. Draw pie charts and bar graphs to describe data. NOTE: Be certain that you can provide a substantive interpretation of the statistics. In other worlds, how would you explain the significance of a statistic to your grandmother? The ability to provide a meaningful translation is essential to understanding and conducting political science research. {End original material} FOR THE EXAM: Bring a calculator, pencils, and erasers. I will supply all paper and a stapler. Closed-book section: - inductive reasoning: reasoning from detailed facts to general principles o for example, using a set of observations to make generalizations about data, like “people with pets live longer.” It cannot be proved absolutely that owning a pet makes you longer. Statistical inference is a type of inductive reasoning; i.e. assuming that something is true of a population because it is true of a representative sample. Accurate as long as the sample size is large enough. - deductive reasoning: reasoning from the general to the particular o for example, the classic syllogism: All men are mortal Socrates is a man Therefore, Socrates is mortal - subjective v. objective reasoning: - descriptive statistics: methods for summarizing information so that it is more intelligible, more useful or can be communicated more effectively o i.e. calculating averages, graphing techniques (baseball stats) o can understand what the data actually means inferential statistics: procedures used to generalize from a sample to the larger population and assess the confidence we have in such generalizing o for example, opinion polls using representative samples, with a margin of error o 95/99% confidence intervals o relevant b/c it helps us understand things about large groups that we wouldn’t be able to completely measure randomness: we may see patterns in the world and society that aren’t there, but also not notice patterns when they exist. Statistics can help see through the seeming randomness of phenomena data: the unsummarized records of observations that statistics makes more manageable o ex: what happened at bat every time unit of analysis: the person, object or event that a researcher is studying o ex: individuals, groups, editorials, elections case: the specific unit from which data are collected o ex: the person being interviewed, college students ecological fallacy: the logical error of inferring characteristics of individuals from aggregate data o ex: since people who have dogs tend to live longer, and I own a dog, I will live longer aggregate data: data in which the cases are larger units of analysis - - - - - - - sample: a part of the population that, when chosen randomly, can with degrees of confidence be generalized to the population o statistic: a characteristic of a sample population: all or almost all cases to which a researcher wants to generalize o ex: research on Kentucky: population=all the people living in KY o oftentimes too large, expensive, time-consuming or rapidly-changing to collect all data from the population o parameter: a characteristic of a population averages: o mode: (or Mo) the most frequently occurring score on a variable (for example: female is the modal gender in the US) unimodal distribution: distribution in which one score occurs considerably more often than other scores (i.e. it has only one mode); there will be only one “hump” bimodal distribution: bar graph or histogram shows two scores that are obviously the most common (it has 2 modes); may resemble a camel’s hump o median: (or Md) the value that divides an ordered set of scores in half calculating the mean for: an odd number of scores: put the scores in order from lowest to highest, then find the middle score an even number of scores: put scores in order from lowest to highest, find the two middle scores, average these two scores by adding them and dividing by 2 o mean: ( or X ) the arithmetical average found by dividing the sum of all scores by the number of scores (or N) X - Xi N Levels of measurement: 1. nominal variables: measured such that its attributes are different, but not based on some underlying continuum (like high to low) a. ex: male and female; red white and blue 2. dichotomous variable: has exactly two values a. ex: yes and no, male and female b. nominal and ordinal variables can be dichotomous 3. ordinal variable: one whose values can be rank-ordered, but nothing else a. ex: none, a little, a lot, always / social class 4. interval variable: has values that can be rank-ordered, using a standard unit of measurement (ex: dollar, pounds, inches) 5. ratio variable: like interval variable, but has a non-arbitrary zero point representing the absence of the characteristic being measured (ex: number of years in school, # of hours spent watching TV, temperature in degrees Kelvin) 6. interval-ratio variables: since there aren’t many interval variables in social sciences & they can usually be handled the same way the two are grouped together here - - - - continuous variable: can take on any value in a range of possible values (ex: age measured to the second, attitudinal variables) discrete variable: can have only certain values within its range (ex: family size=1,2,…) o nominal is always discrete, but interval-ratio can be discrete or continuous normal distribution: typically looks like a symmetric bell-shaped curve; a given standard deviation from the mean will always “cut off” a certain percentage of scores. Percentages: 68, 95, 99.7 - within 1, 2, 3 standard deviations in normal distribution. frequency distribution: summarizing data by counting the number of cases with each score percentage distribution: standardizes summaries to some degree; makes them easier to understand, especially with large numbers of observations cumulative distributions: tells us things like what % of respondents are greater than or less than x; useful only for ordinal or interval-ratio variables o cumulative percentage: the percentage of all scores that have a given value or F (100) less--calculated as: N o cumulative frequency: the sum of all frequencies of a given or lesser value skewness: the extent to which a (asymmetric) distribution of a variable has more cases in one direction than another (negatively skewed or positively skewed) - sum of squares: the sum of squared deviations from the mean; calculated as: (X i X ) 2 ; or in other words subtract the mean from each score, square each difference and add all these together) - N – the number of scores. For example, when calculating the mean, you divide the sum of all scores by the number of scores - measures of variation: summarizes how close together or spread out scores are o variance: (represented as s) the average squared deviation from the mean; (calculated as: (X i X )2 N 1 for a sample and the same (except only “N” instead of “N-1” for a population), where N is the number of cases, X(bar) is the mean and Xi is a score watson49 o standard deviation: (represented as σ) the the average deviation from the s mean; calculated as the square root of the variance: o z-score (standard score): the number of standard deviations that a score is from the mean; gives us a standard measure of variation that can be used to compare scores from distributions with different means and standard deviations X X Zi i s - - population distribution: the distribution of scores in a population distribution of a sample: the distribution of scores in a sample of a given size sampling distribution: the distribution of some statistic (e.g. the mean) in all possible samples of a given size; can’t actually draw all possible samples of a given size from large populations, but have figured out what the sampling distribution is for certain important statistics. (see next) central limit theorem: In a random sample, as the sample size N increases, the sampling distribution of the mean more and more closely resembles a normal distribution with a mean equal to the population mean and a standard deviation of . We can N find the proportion of “cases” (i.e. sample means) that lie a given number of standard deviations from the mean of the sampling distribution - standard error: (expressed as x ) the standard deviation of a sampling distribution x N - confidence intervals: our best estimate of μ (population mean) is X (sample mean), but random samples can vary. (Population mean can be more or less than sample mean.) Confidence intervals are a range around X in which μ probably lies (i.e. we are confident that if μ isn’t X , then it’s at least within this range). o 95 percent confidence interval*: based on 95% of scores in a normal distribution lying within 1.96 standard deviations from the mean. 95C.I. = X 1.96 x or in other words, complete the formula by adding the mean to 1.96 times the standard error to find the upper limit, and then subtracting to find the lower limit. The interval will be some number x through some number y o 99 percent confidence interval: based on 99% of scores in a normal distribution lying within 2.58 standard deviations from the mean. 99C.I. = X 2.58 x same as the last one, but substitute 2.58 in there - proportion: a proportion is the number you get before you multiply times 100 to get a percentage; so on p.32-33 for example, the proportion of cases that don’t recycle much is 0.42. 42 percent do not recycle much. It’s just a different way of expressing it. positive relationship: when doing a bivariate analysis, a positive relationship is when higher scores on one variable are associated with higher scores on the other variable (ex: “as x increases, y increases”) negative relationship: higher scores on one variable are associated with lower scores on the other variable (ex: “as x increases, y decreases”) curvilinear relationship: relationships that start positive turn negative, and relationships that start negative turn positive. For example, in class we mentioned the effect of African American population on welfare benefits. When it is low, benefits are high. Increasing this number decreases the benefits, up to a point at which blacks become a significant enough portion of the population to direct policy, and benefits start to rise again. - - Open-book section: (note: I wouldn’t be surprised if I made an error with these numbers somewhere. Brani pointed out one, which I corrected. As long as you don’t screw up the arithmetic, the procedures should be sound though - calculate the mean, median and mode: let’s say we have 6 scores: 12, 7, 18, 44, and 26 o mean (Xbar or μ): add all the scores together (12+7+18+44+26=107) to get the sum of all scores ( X i ) and divide that by the total number of scores (N, or 6 X X i (population mean= μ ) N o median: (or Md) we put these scores in order (7, 12, 18, 26, 44, 107) and would normally just pick the score in the middle, but there is an even number. So we take the middle two scores, add them together and divide by 2 (i.e. find their mean). 18+26=44; 44/2=22 - so 22 is the median here) a good example of this is on the capstone website (exploringhuntington.com); the median household income is about $23,000, which means half the households in Huntington earn less than that. o mode: (or Mo) there is no mode here, because each score occurs only one. So we say it has no mode. However, if there were two 12’s or maybe two 7’s, then the number that occurred twice (by definition more than any other) would be the mode. - calculate a frequency distribution: this is just adding up how many cases have each score. For example, looking at the table below, with f representing the frequency with which each score occurs, we can deduce that 4 people answered “graduate,” 9 answered B.A., etc. when asked about the highest education level they had received: We can also use this data to create percentage distributions and cumulative percentage distributions (see below) The frequency distribution table we did in class on 2/21/05: Civil Disobedience by Education (in frequencies) Title= [dependent variable] by [independent variable] (in [freq. or percent.]) Independent Variables (ascending: lowhigh) [Dependent variables~descending order] Conscience Obey Law < High School High School 4 6 College Total 13 12 29 11 4 21 Total - 10 24 16 calculate a percentage distribution: you’re standardizing the summary distribution by calculating what each frequency would be if there were a total of exactly 1 cases (“percentaging”). Divide each frequency by the total number of cases, then multiply each result by 100. You will have a set of percentages to put into a table now. f (100) N Where f is the number (frequency) of scores that have a given value And N is the total number of cases Percent = Let’s apply this to the table we worked on in class, specifically what those who had completed college answered to the question. 12 (or f) divided by 16 total college cases (or N) is 0.75; multiplied by 100 that is 75. So we can say that ¾ (or 75%) of College graduates answered “Conscience,” suggesting that they would be willing to disobey a law they thought was unjust. Since there is only one other variable we don’t have to do any more math on this question; ¼ of college respondents here think you should obey the law no matter what. Note that you can circle the highest scores in the <High School, High School and College categories to see what most answered in each one, and with our table this demonstrates a positive relationship between education and civil disobedience. - calculate a cumulative [percentage] distribution: a cumulative percentage is the % of all scores with a given value or less, so first you add all the frequencies for the given value and also the ones for all lesser values (this value is F). Then divide F by the total number of cases (or N). Multiply the result by 100. 50 F (100) N Where F is the sum of a given value and lesser values Percent = Since the Civil Disobedience question was dichotomous (two values), a cumulative distribution doesn’t apply in any way here. After all, of what use is adding the given value and the sum of lesser values if there are only two frequencies? You would always get a result of 100 (12+4=16; 16/1=1; 1(100)=100) percent, and that doesn’t tell you shit! So this time we’ll take a look at the Education table. 24 have completed high school, and 10 percent have completed less than high school. So we add the scores (24+10=34) and divide that result by the total number of cases (50) to get 0.68l multiply that by 100 to get our cumulative percentage, 68%. In substantive terms, this would be expressed as “68% have completed only high school or less.” - calculate the variance, standard deviation, standard error, z-score and confidence interval: Most of this is just plugging numbers into the variables and then doing the math. Let’s look at Chapter 4 in the workbook, question # 12 first. The formula for , and since we’re given (the standard N deviation) as being 3.00 and N (the number of cases) as being 300, we just need to put 3.00 3.00 them in place and calculate: x 0.173 300 17.321 the standard error ( x ) is x (expressed as Zi) the variance formula (for sample data- s) is (X i X )2 , and for population it is simply over N-1. We’re working N 1 with sample data here, not generalizing to the population, so we go with the first one. For this let’s just make up a set of scores, they will be 3, 7, 11, 16, 22, 36 and 49. First we have to find the mean by adding up all the scores (= 144) and dividing that by the number of scores (7). So 20.571 is our sample mean ( X ). Now we take each score, subtract that mean from it, and square each result. So for example 3 – 20.571 = -17.571. -17.571 squared is 308.74. s= (3 – 20.571)2 = (-13.571)2 = 308.840 (7 – 20.571)2 = (-13.571)2 = 184.172 (11 – 20.571)2 = (-9.571)2 = 91.604 (16 – 20.571)2 = (-4.571)2 = 20.894 (22 – 20.571)2 = (1.429)2 = 2.042 (36 – 20.571)2 = (15.429)2 = 238.054 (49 – 20.571)2 = (28.429)2 = 808.208 Damn, I hope I did all that right. We do this same operation for all the scores. After doing that, we add all of them up and divide the sum (1653.814) by N-1, which in this case is 7-1 or 6. And that will give us the variance (s) of 275.64 Now to find the standard deviation, we need only to find the square root of the variance, which is 16.6. So the standard deviation ( s for sample data in this case, or for a population ) is 16.6. o So how about we convert one of those scores into a standard score, or z-score? (expressed as Zi)This will tell us how many standard deviations the score is away from the mean; a standard format for comparing scores for just about anything, one would think. All that is required for this is to subtract the mean from the score and divide it by X X the standard deviation (which we know is 16.6): Z i i s So the formula, if we wanted the z-score of 49, would look like this: 49 20.571 Zi 1.713 ; therefore we say that 49 is about 1.7 standard deviations 16.6 from the mean. Finally, we’ll calculate the confidence interval: As everyone is no doubt painfully aware, a random sample can vary and the sample mean ( X ) could be different from the population mean ( μ ). But the confidence interval will give us a range within which, if they are not the same, the population mean lies. Again we’ll use the info from Chapter 4, question 12 in the workbook: 95 percent confidence interval Since 95% of scores in a normal distribution lie within 1.96 standard deviations from the mean, we subtract 1.96 standard errors from the sample mean to find the lower limit and add 1.96 standard errors to find the upper limit. So there are two equations to do. That’s what the plus/minus sign means here: y X 1.96 x Notice that I have represented whatever number the 95% conf. interval will be as y, but that is just arbitrary; you could write “95% interval” for all it matters. We now just plug numbers into these variables and work it out: y=25-1.96(0.173) y=25+1.96(0.173) That comes to 24.661 (by subtracting) and 25.339 (by adding). So (y =) 24.66125.339 is our 95% confidence interval This process is exactly the same when calculating the 99 percent confidence interval, but we use the number 2.58 instead of 1.96 because we know that in a normal distribution, 99% of scores lie within 2.58 standard deviations from the mean: y X 2.58 x y 25 2.58(0.173) 24.554 y 25 2.58(0.173) 25.446 So our 99% confidence interval is (y=) 24.554-25.446