Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Weights in stata Many ipums data projects have complex sampling designs, which means that not everyone has the same probability of being selected into the sample. For example, let’s consider an island where we have 1000 dogs. We want to know about these dogs, so we decide to take a sample of 100 dogs. If we took a random 1/10 sample of the island’s dog population, each dog in the sample would represent 10 dogs. Now, let’s say this island has 850 golden retrievers and only 150 miniature poodles. If we do a simple 1/10 random sample, we’d only get around 15 miniature poodles. That’s not enough to get a good idea of the characteristics of miniature poodle population. So instead, we sample 30 out of 150 miniature poodles (one in five) and 70/850 golden retrievers (about 1 in every 12.14 golden retrievers; but let’s be nice and say 1 in 12). So now, every miniature poodle in the sample represents 5 miniature poodles in the population and every golden retriever in the sample represents 12 golden retrievers in the population. Now, let’s say we want to estimate the average height of dogs on the island. Because our data is weighted, we need to think carefully about exactly what we want from stata. If we just did sum height we would get the average height in the sample (31.01 inches), but this is not the same as the height in the population – the sample height will be far lower than the true population mean because all those extra miniature poodles in the sample are bringing down the sample mean. So, we need to somehow incorporate those weights. How? Stata has four main types of weights. The clearest weight to use for most complex samples is pweight, which is the inverse of the probability of being selected into the sample (for a poodle, this would be 5 and for a golden this would be 12). However, this weight is not always available with all commands. For example, you can’t use it with sum or tab, two of our favorite commands. Tragic! Actually, not tragic. Why? Because we have aweights. Aweights get a kind of weird description in stata. It sounds like aweights should only be used when they represent an average, but they are much broader than that. Let’s review some statistics! You are probably used to seeing things like this: We estimate the population average with the sample mean: 𝑥̅ = ∑ 𝑥𝑖 𝑛 And the estimate of the population standard deviation the sample standard deviation: ∑[(𝑥𝑖 − 𝑥̅ )2 ] 𝑠=√ 𝑛−1 Then the standard deviation of 𝑥̅ , or the standard error, is 𝑠 √𝑛 But things get tricky when we have weights! First, how do we want to calculate that average? We want to weight each observation by the number of dogs it represents: 𝑥̅ = ∑ 𝑤𝑖 ∗ 𝑥𝑖 ∑ 𝑤𝑖 Remember that for each poodle, 𝑤𝑖 is 5 and for each golden retriever it’s 12. This is quite intuitive. If a poodle is 16 inches tall, she will “count” for the 5 poodles she represents. A 36 inch tall golden will count for the 12 golden retrievers she represents. Now, what about the standard deviation? If we are weighting like a probability sample, we will want this: 𝑛 ∗ ∑[𝑤𝑖 (𝑥𝑖 − 𝑥̅ )2 ] 𝑠=√ (𝑛 − 1) ∑ 𝑤𝑖 I want to draw your attention to an important idea here: the calculated mean would remain the same if we simply duplicated each poodle’s data 5 times (eg, we used to have one poodle who was 16 inches tall and now we have 5). And duplicated each golden’s 12 times, etc. However, this would not be the same for the standard deviation. Duplicating the data based on the weight would produce the following estimate of the standard deviation that completely ignores the sample size: ∑[𝑤𝑖 (𝑥𝑖 − 𝑥̅ )2 ] 𝑠=√ ∑ 𝑤𝑖 − 1 We want the first version of the standard deviation - how do we get that? With aweight. This is precisely how stata treats the weight when used as an aweight. Stata will treat an f weight like the second (incorrect) method. When to use fweights? You want to use f weights when you have collapsed the data. Let’s say, for example, I took all the dogs who were the same height, counted them up, and put them on one row: Dognum 1 2 3 4 5 6 7 Height 16 16 16 30 36 36 38 Count 3 1 2 1 Height 16 30 36 38 With this second dataset, it’s correct to use fweight, with the count as the weight. This will “uncollapse” the data and treat it like the first data set. So, to summarize If you use fweight with tab or sum (or tab, sum() ) it will give you the estimated population counts and will base standard deviations and SEs on this supposed larger sample. This is often not what someone actually wants, even for simple tabs. For example if you want to put a table showing the descriptive stats of your sample, you want to actual number of obs in your sample and the estimated population characteristics. Using fweight would inflate your sample and make you look like you have more data that you did. For example, if I were writing a paper about the dogs on the island, I will often want this table: Dog type Poodle Golden Observations Percent 87.8 12.2 100 This shows the number of actual observations and the weighted percentages (uses aweight). If I used fweight, it would show me only the number it’s calculated as being in the population. It would look like I have a complete count dataset, when I don’t: Dog type Poodle Golden Observations Percent 87.8 12.2 1025 A good alternative to this is to use the table command with a pweight. table poodle [pw=weight], c(freq) Aweight can be used in more situations than when it is a calculated mean. Stata treats the aweight exactly like a probability weight in most commands. Some exceptions include collapse (pweight will have the counts total the sum of the weight and aweight will total to the number of obs in the sample). So I would say our advice should generally be: use pweight when you can. When you can’t, either there’s probably a better command or you can use aweight. The vast majority of users should not be using fweights or iweights.