Download The Basics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Time series wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
ECON 309
Lecture 1: The Basics
I. Descriptive versus Inferential Statistics
Descriptive: statistics that summarize the characteristics of given data, without trying to
extrapolate or make predictions.
Inferential: statistics used to make claims or predictions about the larger population
based on a subset (sample) of that population.
The distinction between descriptive and inferential statistics is closely related to the
distinction between populations and samples:
population
sample
Population: the entire set of all possible outcomes or measurements of interest.
Sample: a subset of the population for which we have data, and that we hope is
representative of the population.
Descriptive statistics usually deal with a sample, but sometimes they deal with a whole
population. For instance, we could calculate descriptive statistics using Census data,
which purports to document the entire population of the U.S. (In reality, even the Census
doesn’t document everyone, so it’s really a sample that is very close to being the whole
population.)
Inferential statistics apply when we have a sample and we’re using it to make claims or
predictions (inferences) about the whole population.
In a future lecture, we’ll discuss some problems that arise from trying to get a sample that
is sufficiently representative of the population.
A note on the meaning of “population”: it’s easy to be misled by the name into thinking
a population must be a group of people, or perhaps other organisms. The population can
be all kinds of things: all baseballs manufactured in the U.S. (or by a particular
company), all restaurants in a metropolitan area, all volcanoes in the world, etc.
Moreover, the population may not even in principle be something that we could count or
collect. For instance, suppose we want to know something about the likelihood of a
baseball-making machine producing defective baseballs. The population is not just all
baseballs the machine has ever produced, but every baseball the machine will or ever
could produce. Even if we take all baseballs the machine has ever produced, they are just
a sample of all the baseballs the machine could produce. Similarly, we might want to
know something about the properties of a pair of dice. The population is all the throws of
the dice that could possibly be made, which is infinite (you can keep on throwing the
dice). Any number of actual dice throws is just a sample. So in the real world, we are
usually dealing with samples.
II. Types of Data
Primary versus secondary. Data is primary if it has been collected by the same person or
entity that is using it. Data is secondary if the user is not the person or entity who
collected it.
The book says information from the Census, Bureau of Labor Statistics, Dept. of
Commerce, etc., is secondary. Well, that’s true if you use it. If they (that is, employees
of the Census Bureau) use it, it’s primary.
Why is this important? Because you have control over how your primary data is
collected. You decide the methodology (how the data will be collected). You decide
how many observations to collect, for example. With secondary data, you are stuck with
the methods used by whoever collected it.
What are the ways in which primary data is collected?
1. Direct observation – that is, just watching and measuring. A naturalist watches
squirrels to see how many nuts they collect. A business collects information on
who is buying their product and how much. (The book mentions focus groups as
an example of direct observation, but it really depends on how the focus group is
conducted. If the group is just being watched and recorded, that’s direct
observation. If the group is being asked specific questions, that’s more like a
survey – see below.) In economics, we collect measurements of many economic
variables of interest, like prices and quantities sold, by direct observation.
2. Experiments – subjects are divided into treatment groups and control groups to
measure the difference between them after some kind of treatment is given to the
former group. This is very common in medical testing. It’s much harder to do in
economics and business, but it does happen! Experimental economists get
students in lab settings and expose them to different incentives (such as playing
games for money) to see what they’ll do. Economists also try to find “natural
experiments,” which are on the borderline between direct observation and
experiment. These occur when some kind of policy change or other event
automatically affects one group of subjects but not another (e.g., when New
Jersey raises its minimum wage but Pennsylvania does not). The difficulty is
making sure the treatment group and control group are sufficiently similar in all
other respects.
3. Surveys – asking people questions. More about in a future lecture. The main
thing to remember about surveys: the information you’re getting is the truth
about how people respond to questions, but not necessarily the truth about the
actual content of the question. If you ask smokers if they want to quit, and 67%
say yes, that doesn’t mean 67% of smokers really want to quit – it means 67% say
they want to quit when asked by a surveyor.
We can also divide data based on the sort of information it contains, and how easily
numbers can be attached to it.
1. Nominal data. This is when the data fits a qualitative category. We can say
whether someone’s a man or a woman, a registered Republican or Democrat, or
something else. Any numbers assigned are arbitrary (e.g., we could code “1” for
man and “0” for woman, but the reverse would have made just as much sense).
2. Ordinal data. This is when numbers are used to represent an ordering or ranking
of items, but the differences between numbers don’t mean anything. E.g., if asked
to rank McDonald’s, BK, and Wendy’s, a consumer might say: 1. BK, 2.
Wendy’s, 3. McDonald’s. That means this consumer thinks BK is better than
both Wendy’s and McDonald’s. But it doesn’t mean, say, that BK is three times
better than McDonald’s, or the difference in quality between BK and Wendy’s is
the same as the difference in quality between Wendy’s and McDonald’s.
3. Interval data. This is numerical data in which differences have meaning, but
ratios don’t. Consider temperature in Fahrenheit. The difference between 40
degrees and 30 degrees is 10, and that’s the same as the difference between 50
degrees and 40 degrees. But is 40 degrees twice as hot as 20? No, that doesn’t
follow. For that to make sense, there must be a true “zero” point. We do
designate one point as zero degrees, but we could just as easily pick another point
(as indicated by the difference between Fahrenheit and Celsius).
4. Ratio data. This is numerical data in which both differences and ratios have
meaning. Prices are a nice example. $10 minus $6 is $4, and $20 minus $16 is
$4; those $4’s are the same. And $20 is twice as much as $10; there is a true zero
point, a price of zero.
In economics, we will also often categorize data by how it relates to time.
1. Cross-sectional data. In cross-sectional data, all observations come from the same
point in time. The observations typically correspond to individuals or groups like
states or countries. For instance, a survey of Americans on who they support in
the upcoming presidential election is cross-sectional data. So is a data set with
the homicide rate for each state in a single year.
2. Longitudinal or time-series data. In longitudinal or time-series data, each data
point corresponds to a particular point in time – usually for a single individual or
group. For instance, if you recorded your income every day for a year, that would
give me a longitudinal data set. The GDP of the U.S. from 1945 to the present is
also a longitudinal data set.
3. Panel data. Panel data is both cross-sectional and longitudinal. It involves getting
cross-sectional data for many time periods (or, alternatively, time-series data for
many different individuals or groups). For instance, if you recorded the income
for each one of your classmates every year for the next 20 years, that would be a
panel data set.
One way to think of this is in terms of dimensions. Both cross-sectional and time-series
data are one-dimensional; panel data is two-dimensional.
III. Measures of Central Tendency
A measure of central tendency is meant to tell us the “center” of a data set or population.
What do we mean by “center”? That’s an inherently vague question. We might mean a
typical value, or the most common value, or a value that’s in the middle… We need to be
more specific.
Mean
The mean is what we usually think of as the average (although “average” can be used to
refer to other measures of central tendency as well). For a sample, the mean is the sum of
all observations divided by the number of observations. Here is the formula for the
sample mean:
n
x
x
i 1
i
n
For the mean of a population, in principle we do the same thing: add up the values of all
possible observations, and divide by the number of possible observations. But what if
there is no maximum number of observations? Consider rolling a standard 6-sided die.
We could roll it an infinite number of times, so any finite number of rolls is only a
sample. What is the population mean? You have to take each possible value for an
outcome (in this case, one through six), multiply by its frequency as a fraction of all
outcomes, and add up the results. Here is the formula for the population mean:
   xf ( x)
xX
(The expression f(x) means the frequency of the value x in the population; it is a number
between 0 and 1.) We use the Greek letter mu (μ) to stand for the population mean,
which we sometimes call the “true” mean. For the throw of 6-sided die, the population
mean is 1*(1/6) + 2*(1/6) + 3*(1/6) + 4*(1/6) + 5*(1/6) + 6*(1/6) = 3.5.
In most cases, we don’t actually know the population mean, so we try to estimate it with
the sample mean.
The technique we just used to find the population mean is also useful when you’re given
sample data in the form of a frequency table, with each value that occurred alongside the
frequency with which it occurred. E.g.,
Answers to “How Many Times Did You Use the Restroom Today?”
Answer
0
1
2
3
4
5
# Subjects
1
3
4
8
5
3
We take each possible answer and multiply by its frequency in the sample. The sample
mean here is [0(1) + 1(3) + 2(4) + 3(8) + 4(5) + 5(3)]/24 = 2.92. (Notice that this is the
same as finding the fractional frequency of each value as # Subjects/24, multiplying by
the answer value, and then adding them up.)
Median
The median is the value of the observation exactly in the middle of the sample or
population, such that half the observations have a higher value and half have a lower
value. (If there is an even number of observations, then there is no true middle value, so
the median is defined as the mean of the two middle values.)
In the restroom example, the median is 3 (because there are eight observations higher
than 3 and eight observations lower, and the middle eight observations are all the same).
What would happen if we added three more people who went to the restroom 6 times
each? The total observations would be 29, so we’d be looking for the 15th highest (and
15th lowest) observation. It’s still 3!
Mode
The mode is the most common outcome in the sample or population. For example, the
modal sex in the United States is female (because there are more women than men). The
modal race in the United States is white (because there are more whites than there are of
any one other race). Note that this latter example will still be correct when whites are no
longer the majority race, because that will just mean that people of all the other races
together outnumber whites. Mode means the single most common outcome, not
necessarily the majority outcome.
(We sometimes use the word plurality to mean the most common outcome, which may or
may not be the majority outcome. For instance, Clinton won the presidency in 1992 with
43% of the popular vote; this was the plurality, greater than Bush’s 37.5% and Perot’s
18.9%, but it was not the majority. The mode and the plurality are essentially the same,
although mode is the preferred term in statistics.)
The examples involve nominal data, and that’s where mode is most often useful. But it
can be used with numbers, too. In the restroom frequency table above, the mode is 3. In
this example, the possible outcomes are numerical, but they are also discrete, meaning
they take on a countable number of different values. You can’t use the bathroom onehalf a time!
The mode makes less sense with characteristics that are not discrete, but are instead
continuous, meaning the variable can take on an uncountable number of different values.
Consider height. If you measure height precisely enough, it’s difficult to find anyone
who is exactly any height you specify in advance – e.g., 6’0”. Everyone you find will be
just slightly above or slightly below it. The frequency of any precisely defined height is
approximately zero! So to define the mode in cases like this, you need to establish
intervals (such as inches of height, which actually includes an interval of heights rounded
off to the nearest whole inch).
If you have to create intervals to find the mode, the mode obviously depends on how
you’ve defined your intervals. For instance, we could choose half-inch intervals instead
of whole-inch intervals for measuring height; our answer for the mode will be a half-inch
interval instead of a whole-inch interval.
If your intervals are not of equal size, then the mode can be difficult to find, and in some
cases is actually meaningless. We’ll see an example of this in a future lecture.
Much confusion results from people not truly understanding these three different
measures of central tendency and how they can differ. We will return to this in a future
lecture.
IV. Variance and Standard Deviation
Variance: a measure of dispersion in which observations are weighted by the square of
their distance from the mean, as given by the following formula:
n
s2 
 (x
i 1
i
 x)2
n 1
Note that this gives greater weight to an observation the further it is from the mean. For
example, suppose the mean is zero. An observation of 4 (or -4) would be weighted four
times as much as an observation of 2 (or -2), despite being only twice as far from the
mean.
There is an equivalent formula for the sample variance that is often easier to use:
 n 
 x i 
n
2
xi   i 1 

n
s 2  i 1
n 1
2
Recall that the mean could be calculated for both the sample and the population. The
same is true of the variance. The formula for the population variance is:
 2   ( x   ) 2 f ( x)
xX
All you’re doing here is taking each possible value of the x, finding its squared difference
from the mean (just as with the sample variance), and weighting it by its frequency in the
population. The population variance for throws of a six-sided die is:
(1/6)[(1 – 3.5)2 + (2 – 3.5)2 + (3 – 3.5)2 + (4 – 3.5)2 + (5 – 3.5)2 + (6 – 3.5)2]
= (1/6)[6.25 + 2.25 + 0.25 + 0.25 + 2.25 + 6.25] = 2.92
You might wonder why, for the sample variance, we divided by (n – 1) instead of just n.
If we wanted to just weight each observation’s squared difference from the mean by the
frequency of that observation, we would divide by n. So why use (n – 1) instead? The
answer is complicated, but here’s the simple version: In calculating the population
variance, we take into account every possible value and then weight that value by its true
frequency. But when we take a sample, you probably won’t get a sample that represents
every possible value including the most unlikely ones. For instance, if you take a sample
of heights, you probably won’t get a sample that includes men who are 8’4” or 3’6”, even
though such men exist in the population. As a result, your sample understates the amount
of variation in the population. Dividing by (n – 1) instead of n corrects for this
underestimation.
Standard deviation: the square root of the variance (for both the sample and the
population).
s  s2
  2
V. Quartiles, Quintiles, Deciles, etc.
We use these measures when we want to divide a group or population into a number of
equal-sized subgroups. Quartiles are four equal-sized subgroups; quintiles are five equalsized subgroups; deciles are ten equal-sized subgroups. (There are others, but these are
the most common.)
Note that “equal-sized” is with respect to the number of observations or members in each
subgroup, not the size of the group’s interval. For instance, households are often divided
into income quintiles. The top quintile includes the 20% of households that have the
highest annual income. The bottom quintile includes the 20% of households that have
the lowest annual income. These quintiles include equal numbers of households, but they
will not correspond to the same size intervals of incomes.
Percentiles or “Xiles” are used for various purposes, but most often in economics for
dividing the population into income groups. This can be useful for getting a sense of the
dispersion of incomes in the economy. But to see how they can be misleading, notice
that the dividing lines are much like the median: they can be invariant to changes on
either side of them. E.g., people at the top of the top quintile, or the bottom of the bottom
quintile, could get richer or poorer without affecting the quintile dividing lines.