Download PSC 211 Midterm Study Guide

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Confidence interval wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Gibbs sampling wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
PSC 211 Midterm Study Guide
SCROLL DOWN TO PAGE 2
(note: I post this edited syllabus with all due humility, noting that it is not perfect and reflects my
approaches to the problems we might have to do for the test, not necessarily yours.)
{Original Material}
You should be able to define and give the significance of the following. Where applicable, you
should know the symbol that represents the concept and be familiar with the formula for deriving
the statistic.
Close-book section of the exam:
Inductive reasoning
Deductive reasoning
Subjective vs. objective reasoning
Descriptive statistics
Inferential statistics randomness
Data
Case
Unit of analysis
Ecological fallacy
Sample population
Nominal, ordinal, interval, and ratio level
variables
Discrete verses continuous variables
Frequency distributions
Percentage distributions
Cumulative distributions
Unimodal distribution
Sum of squares
Variance
Standard deviation
Bimodal distribution
Mean
Median
Mode
Skewness
N
Normal distribution
z-scores
sampling distribution
distribution of a sample
population distribution
standard error
standardized variables
95% confidence interval
99% confidence interval
Positive relationship
Negative relationship
Curvilinear relationship
Central limit theorem
Proportion
Open Book Section of Exam
There will be an open book, open notes section of the exam. In this portion, you will need to:
 Calculate the mean, median, and mode.
 Calculate frequency distributions, percentage distributions and cumulative percentage
distributions.
 Calculate, the variance, standard deviation, z-scores, standard error, and confidence
intervals for data sets.
 Draw pie charts and bar graphs to describe data.
NOTE: Be certain that you can provide a substantive interpretation of the statistics. In other
worlds, how would you explain the significance of a statistic to your grandmother? The ability
to provide a meaningful translation is essential to understanding and conducting political science
research. {End original material}
FOR THE EXAM:
Bring a calculator, pencils, and erasers. I will supply all paper and a stapler.
Closed-book section:
-
inductive reasoning: reasoning from detailed facts to general principles
o for example, using a set of observations to make generalizations about data, like
“people with pets live longer.” It cannot be proved absolutely that owning a pet
makes you longer.
 Statistical inference is a type of inductive reasoning; i.e. assuming that
something is true of a population because it is true of a representative
sample. Accurate as long as the sample size is large enough.
-
deductive reasoning: reasoning from the general to the particular
o for example, the classic syllogism:
 All men are mortal
 Socrates is a man
 Therefore, Socrates is mortal
-
subjective v. objective reasoning:
-
descriptive statistics: methods for summarizing information so that it is more
intelligible, more useful or can be communicated more effectively
o i.e. calculating averages, graphing techniques (baseball stats)
o can understand what the data actually means
inferential statistics: procedures used to generalize from a sample to the larger
population and assess the confidence we have in such generalizing
o for example, opinion polls using representative samples, with a margin of error
o 95/99% confidence intervals
o relevant b/c it helps us understand things about large groups that we wouldn’t be
able to completely measure
randomness: we may see patterns in the world and society that aren’t there, but also not
notice patterns when they exist. Statistics can help see through the seeming randomness
of phenomena
data: the unsummarized records of observations that statistics makes more manageable
o ex: what happened at bat every time
unit of analysis: the person, object or event that a researcher is studying
o ex: individuals, groups, editorials, elections
case: the specific unit from which data are collected
o ex: the person being interviewed, college students
ecological fallacy: the logical error of inferring characteristics of individuals from
aggregate data
o ex: since people who have dogs tend to live longer, and I own a dog, I will live
longer
aggregate data: data in which the cases are larger units of analysis
-
-
-
-
-
-
-
sample: a part of the population that, when chosen randomly, can with degrees of
confidence be generalized to the population
o statistic: a characteristic of a sample
population: all or almost all cases to which a researcher wants to generalize
o ex: research on Kentucky: population=all the people living in KY
o oftentimes too large, expensive, time-consuming or rapidly-changing to collect all
data from the population
o parameter: a characteristic of a population
averages:
o mode: (or Mo) the most frequently occurring score on a variable (for example:
female is the modal gender in the US)
 unimodal distribution:
distribution in which one score occurs
considerably more often than other scores (i.e. it has only one mode); there
will be only one “hump”
 bimodal distribution: bar graph or histogram shows two scores that are
obviously the most common (it has 2 modes); may resemble a camel’s
hump
o median: (or Md) the value that divides an ordered set of scores in half
 calculating the mean for:
 an odd number of scores: put the scores in order from lowest to
highest, then find the middle score
 an even number of scores: put scores in order from lowest to
highest, find the two middle scores, average these two scores by
adding them and dividing by 2
o mean: (  or X ) the arithmetical average found by dividing the sum of all scores
by the number of scores (or N)
X 
-

Xi
N
Levels of measurement:
1. nominal variables: measured such that its attributes are different, but not based
on some underlying continuum (like high to low)
a. ex: male and female; red white and blue
2. dichotomous variable: has exactly two values
a. ex: yes and no, male and female
b. nominal and ordinal variables can be dichotomous
3. ordinal variable: one whose values can be rank-ordered, but nothing else
a. ex: none, a little, a lot, always / social class
4. interval variable: has values that can be rank-ordered, using a standard unit of
measurement (ex: dollar, pounds, inches)
5. ratio variable: like interval variable, but has a non-arbitrary zero point
representing the absence of the characteristic being measured (ex: number of
years in school, # of hours spent watching TV, temperature in degrees Kelvin)
6. interval-ratio variables: since there aren’t many interval variables in social
sciences & they can usually be handled the same way the two are grouped
together here
-
-
-
-
continuous variable: can take on any value in a range of possible values (ex: age
measured to the second, attitudinal variables)
discrete variable: can have only certain values within its range (ex: family size=1,2,…)
o nominal is always discrete, but interval-ratio can be discrete or continuous
normal distribution: typically looks like a symmetric bell-shaped curve; a given
standard deviation from the mean will always “cut off” a certain percentage of scores.
Percentages: 68, 95, 99.7 - within 1, 2, 3 standard deviations in normal distribution.
frequency distribution: summarizing data by counting the number of cases with each
score
percentage distribution: standardizes summaries to some degree; makes them easier to
understand, especially with large numbers of observations
cumulative distributions: tells us things like what % of respondents are greater than or
less than x; useful only for ordinal or interval-ratio variables
o cumulative percentage: the percentage of all scores that have a given value or
F
(100)
less--calculated as:
N
o cumulative frequency: the sum of all frequencies of a given or lesser value
skewness: the extent to which a (asymmetric) distribution of a variable has more cases in
one direction than another (negatively skewed or positively skewed)
-
sum of squares: the sum of squared deviations from the mean; calculated as:
 (X i  X ) 2 ; or in other words subtract the mean from each score, square each
difference and add all these together)
-
N – the number of scores. For example, when calculating the mean, you divide the sum
of all scores by the number of scores
-
measures of variation: summarizes how close together or spread out scores are
o variance: (represented as s) the average squared deviation from the mean;
(calculated as:
 (X
i
 X )2
N 1
for a sample and the same (except only “N” instead of “N-1” for a population), where
N is the number of cases, X(bar) is the mean and Xi is a score
watson49
o standard deviation: (represented as σ)
the the average deviation from the
s
mean; calculated as the square root of the variance:
o z-score (standard score): the number of standard deviations that a score is from
the mean; gives us a standard measure of variation that can be used to compare
scores from distributions with different means and standard deviations
X X
Zi  i
s
-
-
population distribution: the distribution of scores in a population
distribution of a sample: the distribution of scores in a sample of a given size
sampling distribution: the distribution of some statistic (e.g. the mean) in all possible
samples of a given size; can’t actually draw all possible samples of a given size from
large populations, but have figured out what the sampling distribution is for certain
important statistics. (see next)
central limit theorem: In a random sample, as the sample size N increases, the
sampling distribution of the mean more and more closely resembles a normal distribution
with a mean equal to the population mean and a standard deviation of

. We can
N
find the proportion of “cases” (i.e. sample means) that lie a given number of standard
deviations from the mean of the sampling distribution
-
standard error: (expressed as  x ) the standard deviation of a sampling distribution
x

N
-
confidence intervals: our best estimate of μ (population mean) is X (sample mean), but
random samples can vary. (Population mean can be more or less than sample mean.)
Confidence intervals are a range around X in which μ probably lies (i.e. we are confident
that if μ isn’t X , then it’s at least within this range).
o 95 percent confidence interval*: based on 95% of scores in a normal
distribution lying within 1.96 standard deviations from the mean.
95C.I. =
X  1.96 x
 or in other words, complete the formula by adding the mean to 1.96 times
the standard error to find the upper limit, and then subtracting to find the
lower limit. The interval will be some number x through some number y
o 99 percent confidence interval: based on 99% of scores in a normal distribution
lying within 2.58 standard deviations from the mean. 99C.I. = X  2.58 x
 same as the last one, but substitute 2.58 in there
-
proportion: a proportion is the number you get before you multiply times 100 to get a
percentage; so on p.32-33 for example, the proportion of cases that don’t recycle much is
0.42. 42 percent do not recycle much. It’s just a different way of expressing it.
positive relationship: when doing a bivariate analysis, a positive relationship is when
higher scores on one variable are associated with higher scores on the other variable (ex:
“as x increases, y increases”)
negative relationship: higher scores on one variable are associated with lower scores on
the other variable (ex: “as x increases, y decreases”)
curvilinear relationship: relationships that start positive turn negative, and relationships
that start negative turn positive. For example, in class we mentioned the effect of African
American population on welfare benefits. When it is low, benefits are high. Increasing
this number decreases the benefits, up to a point at which blacks become a significant
enough portion of the population to direct policy, and benefits start to rise again.
-
-
Open-book section: (note: I wouldn’t be surprised if I made an error with these numbers
somewhere. Brani pointed out one, which I corrected. As long as you don’t screw up the
arithmetic, the procedures should be sound though
-
calculate the mean, median and mode: let’s say we have 6 scores: 12, 7, 18, 44, and 26
o mean (Xbar or μ): add all the scores together (12+7+18+44+26=107) to get the
sum of all scores (  X i ) and divide that by the total number of scores (N, or 6
X 
X
i
(population mean= μ )
N
o median: (or Md) we put these scores in order (7, 12, 18, 26, 44, 107) and would
normally just pick the score in the middle, but there is an even number. So we
take the middle two scores, add them together and divide by 2 (i.e. find their
mean). 18+26=44; 44/2=22 - so 22 is the median
here)

a good example of this is on the capstone website
(exploringhuntington.com); the median household income is about
$23,000, which means half the households in Huntington earn less than
that.
o mode: (or Mo) there is no mode here, because each score occurs only one. So
we say it has no mode. However, if there were two 12’s or maybe two 7’s, then
the number that occurred twice (by definition more than any other) would be the
mode.
-
calculate a frequency distribution: this is just adding up how many cases have each
score. For example, looking at the table below, with f representing the frequency with
which each score occurs, we can deduce that 4 people answered “graduate,” 9 answered
B.A., etc. when asked about the highest education level they had received:
We can also use this data to create percentage distributions and cumulative percentage
distributions (see below)
The frequency distribution table we did in class on 2/21/05:
Civil Disobedience by Education (in frequencies)
Title= [dependent variable] by [independent variable] (in [freq. or percent.])
Independent Variables (ascending: lowhigh)
[Dependent variables~descending order]
Conscience
Obey
Law
< High
School
High
School
4
6
College
Total
13
12
29
11
4
21
Total
-
10
24
16
calculate a percentage distribution: you’re standardizing the summary distribution by
calculating what each frequency would be if there were a total of exactly 1 cases
(“percentaging”). Divide each frequency by the total number of cases, then multiply each
result by 100. You will have a set of percentages to put into a table now.
f
(100)
N
Where f is the number (frequency) of scores
that have a given value
And N is the total number of cases
Percent =
Let’s apply this to the table we worked on in class, specifically what those who had completed
college answered to the question. 12 (or f) divided by 16 total college cases (or N) is 0.75;
multiplied by 100 that is 75. So we can say that ¾ (or 75%) of College graduates answered
“Conscience,” suggesting that they would be willing to disobey a law they thought was unjust.
Since there is only one other variable we don’t have to do any more math on this question; ¼ of
college respondents here think you should obey the law no matter what. Note that you can circle
the highest scores in the <High School, High School and College categories to see what most
answered in each one, and with our table this demonstrates a positive relationship between
education and civil disobedience.
-
calculate a cumulative [percentage] distribution: a cumulative percentage is the % of
all scores with a given value or less, so first you add all the frequencies for the given
value and also the ones for all lesser values (this value is F). Then divide F by the total
number of cases (or N). Multiply the result by 100.
50
F
(100)
N
Where F is the sum
of a given value
and lesser values
Percent =
Since the Civil Disobedience question was dichotomous (two values), a cumulative
distribution doesn’t apply in any way here. After all, of what use is adding the given value and
the sum of lesser values if there are only two frequencies? You would always get a result of 100
(12+4=16; 16/1=1; 1(100)=100) percent, and that doesn’t tell you shit!
So this time we’ll take a look at the Education table. 24 have completed high school, and
10 percent have completed less than high school. So we add the scores (24+10=34) and divide
that result by the total number of cases (50) to get 0.68l multiply that by 100 to get our
cumulative percentage, 68%. In substantive terms, this would be expressed as “68% have
completed only high school or less.”
-
calculate the variance, standard deviation, standard error, z-score and confidence
interval: Most of this is just plugging numbers into the variables and then doing the
math. Let’s look at Chapter 4 in the workbook, question # 12 first. The formula for

, and since we’re given  (the standard
N
deviation) as being 3.00 and N (the number of cases) as being 300, we just need to put
3.00
3.00
them in place and calculate:  x 

 0.173
300 17.321
the standard error (  x ) is
x
(expressed as Zi) the variance formula (for sample data- s) is
 (X
i
 X )2
, and for population it is simply over N-1. We’re working
N 1
with sample data here, not generalizing to the population, so we go with the first one. For
this let’s just make up a set of scores, they will be 3, 7, 11, 16, 22, 36 and 49. First we
have to find the mean by adding up all the scores (= 144) and dividing that by the number
of scores (7). So 20.571 is our sample mean ( X ). Now we take each score, subtract that
mean from it, and square each result. So for example 3 – 20.571 = -17.571. -17.571
squared is 308.74.
s=
(3 – 20.571)2 = (-13.571)2 = 308.840
(7 – 20.571)2 = (-13.571)2 = 184.172
(11 – 20.571)2 = (-9.571)2 = 91.604
(16 – 20.571)2 = (-4.571)2 = 20.894
(22 – 20.571)2 = (1.429)2 = 2.042
(36 – 20.571)2 = (15.429)2 = 238.054
(49 – 20.571)2 = (28.429)2 = 808.208
Damn, I hope I did all that right. We do this same operation for all the scores.
After doing that, we add all of them up and divide the sum (1653.814) by N-1, which in
this case is 7-1 or 6. And that will give us the variance (s) of 275.64
Now to find the standard deviation, we need only to find the square root of the
variance, which is 16.6. So the standard deviation ( s for sample data in this case, or 
for a population ) is 16.6.
o So how about we convert one of those scores into a standard score, or z-score?
(expressed as Zi)This will tell us how many standard deviations the score is away
from the mean; a standard format for comparing scores for just about anything,
one would think.
All that is required for this is to subtract the mean from the score and divide it by
X X
the standard deviation (which we know is 16.6): Z i  i
s
So the formula, if we wanted the z-score of 49, would look like this:
49  20.571
Zi 
 1.713 ; therefore we say that 49 is about 1.7 standard deviations
16.6
from the mean.
Finally, we’ll calculate the confidence interval:
As everyone is no doubt painfully aware, a random sample can vary and the sample mean
( X ) could be different from the population mean ( μ ). But the confidence interval will give us
a range within which, if they are not the same, the population mean lies. Again we’ll use the info
from Chapter 4, question 12 in the workbook:
95 percent confidence interval Since 95% of scores in a normal
distribution lie within 1.96 standard deviations from the mean, we subtract 1.96
standard errors from the sample mean to find the lower limit and add 1.96
standard errors to find the upper limit. So there are two equations to do. That’s
what the plus/minus sign means here: y  X  1.96 x
Notice that I have
represented whatever number the 95% conf. interval will be as y, but that is just
arbitrary; you could write “95% interval” for all it matters. We now just plug
numbers into these variables and work it out:
y=25-1.96(0.173)
y=25+1.96(0.173)
That comes to 24.661 (by subtracting) and 25.339 (by adding). So (y =) 24.66125.339 is our 95% confidence interval
This process is exactly the same when calculating the 99 percent confidence
interval, but we use the number 2.58 instead of 1.96 because we know that in a normal
distribution, 99% of scores lie within 2.58 standard deviations from the mean:
y  X  2.58 x
y  25  2.58(0.173)  24.554
y  25  2.58(0.173)  25.446
So our 99% confidence interval is (y=) 24.554-25.446