* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download March 2013 Lecture: Missing Data Part 1 Follow-up
Hardware random number generator wikipedia , lookup
Geographic information system wikipedia , lookup
Inverse problem wikipedia , lookup
Neuroinformatics wikipedia , lookup
Theoretical computer science wikipedia , lookup
Pattern recognition wikipedia , lookup
Data analysis wikipedia , lookup
Additional examples of Missingness Mechanisms – Follow up to SON Brown Bag Presentation – 3/20/13 (C Thompson) – Missing Data part 1 From Baraldi/Enders 2009 reference pp7-9: Theoretical background: Rubin's missing data mechanisms Before we can begin discussing different missing data handling options, it is important to have a solid understanding of so-called “missing data mechanisms”. Rubin (1976) and colleagues (Little & Rubin, 2002) came up with the classification system that is in use today: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). These mechanisms describe relationships between measured variables and the probability of missing data. While these terms have a precise probabilistic and mathematical meaning, they are essentially three different explanations for why the data are missing. From a practical perspective, the mechanisms are assumptions that dictate the performance of different missing data techniques. We give a conceptual description of each mechanism in this section, and supplementary resources are available to readers who want additional details on the missing data mechanisms (Allison, 2002; Enders, 2010; Little & Rubin, 2002; Rubin, 1976; Schafer & Graham, 2002). To begin, data are MCAR when the probability of missing data on a variable X is unrelated to other measured variables and to the values of X itself. In other words, missingness is completely unsystematic and the observed data can be thought of as a random subsample of the hypothetically complete data. As an example, consider a child in an educational study that moves to another district midway through the study. The missing values are MCAR if the reason for the move is unrelated to other variables in the data set (e.g., socioeconomic status, disciplinary problems, or other study-related variables). Other examples of MCAR occur when a participant misses a survey administration due to scheduling difficulties or other unrelated reasons (such as a doctor's appointment), a computer randomly misreads grid-in sheets, or an administrative blunder causes several test results to be misplaced prior to data entry. MCAR data may also be a purposeful byproduct of the research design. For example, suppose that a researcher collects self-report data from the entire sample but limits time-consuming behavioral observations to a random subset of participants. We describe a number of these so-called planned missing data designs at the end of the paper. Because MCAR requires missingness to be unrelated to study variables, methodologists often argue that it is a very strict assumption that is unlikely to be satisfied in practice (Raghunathan, 2004; Muthen, Kaplan, & Hollis, 1987). The MAR mechanism requires a less stringent assumption about the reason for missing data. Data are MAR if missingness is related to other measured variables in the analysis model, but not to the underlying values of the incomplete variable (i.e., the hypothetical values that would have resulted had the data been complete). This terminology is often confusing and misleading because of the use of the word “random.” In fact, an MAR mechanism is not random at all and describes systematic missingness where the propensity for missing data is correlated with other study-related variables in an analysis. As an example of an MAR mechanism, consider a study that is interested in assessing the relationship between substance use and self-esteem in high school students. Frequent substance abuse may be associated with chronic absenteeism, leading to a higher probability of missing data on the self-esteem measure (e.g., because students tend to be absent on the days that the researchers administered the self-esteem questionnaires). This example qualifies as MAR if the propensity for missing data on the self-esteem measure is completely determined by a student's substance use score (i.e., there is no residual C:\Documents and Settings\mdenny1\Local Settings\Temporary Internet Files\Content.Outlook\KZU6WGSS\Examples_Missingness Mechanisms_20130325.doc3/25/2013 2:38 PM p 1 of 3 relationship between the probability of missing data and self-esteem after controlling for substance use). As a second example, suppose that a school district administers a math aptitude exam, and students that score above a certain cut-off participate in an advanced math course. The math course grades are MAR because missingness is completely determined by scores on the aptitude test (e.g., students that score below the cut-off do not have a grade for the advanced math course). Finally, data are MNAR if the probability of missing data is systematically related to the hypothetical values that are missing. In other words, the MNAR mechanism describes data course grades). Although the magnitude of the bias depends on the correlation between the omitted aptitude variable and the course grades (bias increases as the correlation increases), the analysis is nevertheless consistent with an MNAR mechanism. Later in the manuscript, we describe methods for incorporating so-called auxiliary variables that are related to missingness into a statistical analysis. Doing so can mitigate bias (i.e., by making the MAR mechanism more plausible) and can improve power (i.e., by recapturing some of the missing information). From Howell ref (Missing): 1.1 The nature of missing data Missing completely at random There are several reasons why data may be missing. They may be missing because equipment malfunctioned, the weather was terrible, people got sick, or the data were not entered correctly. Here the data are missing completely at random (MCAR). When we say that data are missing completely at random, we mean that the probability that an observation (Xi) is missing is unrelated to the value of Xi or to the value of any other variables. Thus data on family income would not be considered MCAR if people with low incomes were less likely to report their family income than people with higher incomes. Similarly, if Whites were more likely to omit reporting income than African Americans, we again would not have data that were MCAR because missingness would be correlated with ethnicity. However if a participant's data were missing because he was stopped for a traffic violation and missed the data collection session, his data would presumably be missing completely at random. Another way to think of MCAR is to note that in that case any piece of data is just as likely to be missing as any other piece of data. Notice that it is the value of the observation, and not its "missingness," that is important. If people who refused to report personal income were also likely to refuse to report family income, the data could still be considered MCAR, so long as neither of these had any relation to the income value itself. This is an important consideration, because when a data set consists of responses to several survey instruments, someone who did not complete the Beck Depression Inventory would be missing all BDI subscores, but that would not affect whether the data can be classed as MCAR. This nice feature of data that are MCAR is that the analysis remains unbiased. We may lose power for our design, but the estimated parameters are not biased by the absence of data. Missing at random Often data are not missing completely at random, but they may be classifiable as missing at random (MAR). (MAR is not really a good name for this condition because most people would take it to be synonymous with C:\Documents and Settings\mdenny1\Local Settings\Temporary Internet Files\Content.Outlook\KZU6WGSS\Examples_Missingness Mechanisms_20130325.doc3/25/2013 2:38 PM p 2 of 3 MCAR, which it is not. However, the label has stuck.) Let's back up one step. For data to be missing completely at random, the probability that Xi is missing is unrelated to the value of Xi or other variables in the analysis. But the data can be considered as missing at random if the data meet the requirement that missingness does not depend on the value of Xi after controlling for another variable. For example, people who are depressed might be less inclined to report their income, and thus reported income will be related to depression. Depressed people might also have a lower income in general, and thus when we have a high rate of missing data among depressed individuals, the existing mean income might be lower than it would be without missing data. However, if, within depressed patients the probability of reported income was unrelated to income level, then the data would be considered MAR, though not MCAR. Another way of saying this is to say that to the extent that missingness is correlated with other variables that are included in the analysis, the data are MAR. The phraseology is a bit awkward here because we tend to think of randomness as not producing bias, and thus might well think that Missing at Random is not a problem. Unfortunately it is a problem, although in this case we have ways of dealing with the issue so as to produce meaningful and relatively unbiased estimates. But just because a variable is MAR does not mean that you can just forget about the problem. But nor does it mean that You have to throw up your handes and declare that there is nothing to be done The situation in which the data are at least MAR is sometimes referred to as ignorable missingness. This name comes about because for those data we can still produce unbiased parameter estimates without needing to provide a model to explain missingness. Cases of MNAR, to be considered next, could be labeled cases of nonignorable missingness. Missing Not at Random If data are not MCAR or MAR then they are classed as Missing Not at Random (MNAR). For example, if we are studying mental health and people who have been diagnosed as depressed are less likely than others to report their mental status, the data are not missing at random. Clearly the mean mental status score for the available data will not be an unbiased estimate of the mean that we would have obtained with complete data. The same thing happens when people with low income are less likely to report their income on a data collection form. When we have data that are MNAR we have a problem. The only way to obtain an unbiased estimate of parameters is to model missingness. In other words we would need to write a model that accounts for the missing data. That model could then be incorporated into a more complex model for estimating missing values. This is not a task anyone would take on lightly. See Dunning and Freedman (2008) for an example. However even if the data are MNAR, all is not lost. Our estimators may be biased, but the bias may be small. See missingdata.org.uk reference Introduction to missing data Missingness mechanisms C:\Documents and Settings\mdenny1\Local Settings\Temporary Internet Files\Content.Outlook\KZU6WGSS\Examples_Missingness Mechanisms_20130325.doc3/25/2013 2:38 PM p 3 of 3