Download Weights in stata -3-

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Weights in stata
Many ipums data projects have complex sampling designs, which means that not everyone has the same
probability of being selected into the sample.
For example, let’s consider an island where we have 1000 dogs. We want to know about these dogs, so
we decide to take a sample of 100 dogs. If we took a random 1/10 sample of the island’s dog population,
each dog in the sample would represent 10 dogs.
Now, let’s say this island has 850 golden retrievers and only 150 miniature poodles. If we do a simple
1/10 random sample, we’d only get around 15 miniature poodles. That’s not enough to get a good idea
of the characteristics of miniature poodle population. So instead, we sample 30 out of 150 miniature
poodles (one in five) and 70/850 golden retrievers (about 1 in every 12.14 golden retrievers; but let’s be
nice and say 1 in 12). So now, every miniature poodle in the sample represents 5 miniature poodles in
the population and every golden retriever in the sample represents 12 golden retrievers in the
population.
Now, let’s say we want to estimate the average height of dogs on the island. Because our data is
weighted, we need to think carefully about exactly what we want from stata.
If we just did
sum height
we would get the average height in the sample (31.01 inches), but this is not the same as the height in
the population – the sample height will be far lower than the true population mean because all those
extra miniature poodles in the sample are bringing down the sample mean.
So, we need to somehow incorporate those weights. How? Stata has four main types of weights. The
clearest weight to use for most complex samples is pweight, which is the inverse of the probability of
being selected into the sample (for a poodle, this would be 5 and for a golden this would be 12).
However, this weight is not always available with all commands. For example, you can’t use it with sum
or tab, two of our favorite commands. Tragic!
Actually, not tragic. Why? Because we have aweights. Aweights get a kind of weird description in stata.
It sounds like aweights should only be used when they represent an average, but they are much broader
than that.
Let’s review some statistics!
You are probably used to seeing things like this:
We estimate the population average with the sample mean:
𝑥̅ =
∑ 𝑥𝑖
𝑛
And the estimate of the population standard deviation the sample standard deviation:
∑[(𝑥𝑖 − 𝑥̅ )2 ]
𝑠=√
𝑛−1
Then the standard deviation of 𝑥̅ , or the standard error, is
𝑠
√𝑛
But things get tricky when we have weights!
First, how do we want to calculate that average? We want to weight each observation by the number of
dogs it represents:
𝑥̅ =
∑ 𝑤𝑖 ∗ 𝑥𝑖
∑ 𝑤𝑖
Remember that for each poodle, 𝑤𝑖 is 5 and for each golden retriever it’s 12. This is quite intuitive. If a
poodle is 16 inches tall, she will “count” for the 5 poodles she represents. A 36 inch tall golden will count
for the 12 golden retrievers she represents.
Now, what about the standard deviation? If we are weighting like a probability sample, we will want
this:
𝑛 ∗ ∑[𝑤𝑖 (𝑥𝑖 − 𝑥̅ )2 ]
𝑠=√
(𝑛 − 1) ∑ 𝑤𝑖
I want to draw your attention to an important idea here: the calculated mean would remain the same if
we simply duplicated each poodle’s data 5 times (eg, we used to have one poodle who was 16 inches tall
and now we have 5). And duplicated each golden’s 12 times, etc. However, this would not be the same
for the standard deviation. Duplicating the data based on the weight would produce the following
estimate of the standard deviation that completely ignores the sample size:
∑[𝑤𝑖 (𝑥𝑖 − 𝑥̅ )2 ]
𝑠=√
∑ 𝑤𝑖 − 1
We want the first version of the standard deviation - how do we get that? With aweight. This is
precisely how stata treats the weight when used as an aweight. Stata will treat an f weight like the
second (incorrect) method.
When to use fweights?
You want to use f weights when you have collapsed the data. Let’s say, for example, I took all the dogs
who were the same height, counted them up, and put them on one row:
Dognum
1
2
3
4
5
6
7
Height
16
16
16
30
36
36
38
Count
3
1
2
1
Height
16
30
36
38
With this second dataset, it’s correct to use fweight, with the count as the weight. This will “uncollapse”
the data and treat it like the first data set.
So, to summarize
If you use fweight with tab or sum (or tab, sum() ) it will give you the estimated population counts and
will base standard deviations and SEs on this supposed larger sample.
This is often not what someone actually wants, even for simple tabs. For example if you want to put a
table showing the descriptive stats of your sample, you want to actual number of obs in your sample
and the estimated population characteristics. Using fweight would inflate your sample and make you
look like you have more data that you did. For example, if I were writing a paper about the dogs on the
island, I will often want this table:
Dog type
Poodle
Golden
Observations
Percent
87.8
12.2
100
This shows the number of actual observations and the weighted percentages (uses aweight).
If I used fweight, it would show me only the number it’s calculated as being in the population. It would
look like I have a complete count dataset, when I don’t:
Dog type
Poodle
Golden
Observations
Percent
87.8
12.2
1025
A good alternative to this is to use the table command with a pweight.
table poodle [pw=weight], c(freq)
Aweight can be used in more situations than when it is a calculated mean. Stata treats the aweight
exactly like a probability weight in most commands. Some exceptions include collapse (pweight will have
the counts total the sum of the weight and aweight will total to the number of obs in the sample).
So I would say our advice should generally be: use pweight when you can. When you can’t, either
there’s probably a better command or you can use aweight. The vast majority of users should not be
using fweights or iweights.