Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistical methods for knowledge discovery in adverse drug reaction surveillance Statistical methods for knowledge discovery in adverse drug reaction surveillance G. Niklas Norén Stockholm University c G. Niklas Norén, Stockholm 2007 Cover photography by G. Niklas Norén ISBN 91-7155-411-4 pp. 1–41 Typeset by LATEX Printed in Sweden by Universitetsservice AB, Stockholm 2007 Distributor: Department of Mathematics, Stockholm University Abstract Collections of individual case safety reports are the main resource for early discovery of unknown adverse reactions to drugs once they have been introduced to the general public. The data sets involved are complex and based on voluntary submission of reports, but contain pieces of very important information. The aim of this thesis is to propose computationally feasible statistical methods for large-scale knowledge discovery in these data sets. The main contributions are a duplicate detection method that can reliably identify pairs of unexpectedly similar reports and a new measure for highlighting suspected drug–drug interaction. Specifically, we extend the hit-miss model for database record matching with a hit-miss mixture model for scoring numerical record fields and a new method to compensate for strong record field correlations. The extended hit-miss model is implemented for the WHO database and demonstrated to be useful in real world duplicate detection, despite the noisy and incomplete information on individual case safety reports. The Information Component measure of disproportionality has been in routine use since 1998 to screen the WHO database for excessive adverse drug reaction reporting rates. Here, it is further refined. We introduce improved credibility intervals for rare events, post-stratification adjustment for suspected confounders and an extension to higher order associations that allows for simple but robust screening for potential risk factors. A new approach to identifying reporting patterns indicative of drug–drug interaction is also proposed. Finally, we describe how imprecision estimates specific to each prediction of a Bayes classifier may be obtained with the Bayesian bootstrap. Such case-based imprecision estimates allow for better prediction when different types of errors have different associated loss, with a possible application in combining quantitative and clinical filters to highlight drug–ADR pairs for clinical review. List of Papers This thesis is based on the following original publications, which are referred to in the text by their Roman numerals. I II III IV V Norén, G. N., Orre, R., Bate, A., Edwards, I. R. (2007). Duplicate detection in adverse drug reaction surveillance. Data Mining and Knowledge Discovery. Published on-line. Norén, G. N., Bate, A., Orre, R., Edwards, I. R. (2006). Extending the methods used to screen the WHO drug safety database towards analysis of complex associations and improved accuracy for rare events. Statistics in Medicine, 25(21):3740–3757. Hopstadius, J, Norén, G. N., Bate, A., Edwards, I. R. (2007). Adjustment for potential confounders in adverse drug reaction surveillance. Submitted for publication. Norén, G. N., Sundberg, R., Bate, A., Edwards, I. R. (2007). A statistical methodology for drug–drug interaction surveillance. Submitted for publication. Norén, G. N., Orre, R. (2005). Case based imprecision estimates for Bayes classifiers with the Bayesian bootstrap. Machine Learning, 58(1):79–94. Reprints of I, II and V were made with kind permission from the publishers. Contents Part I: Thesis summary 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Adverse drug reaction surveillance ......................... 2.1 Individual case safety reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 The WHO database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Adverse drug reaction signal detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Knowledge discovery in adverse drug reaction surveillance . . . . . . . 3.1 3.2 3.3 3.4 3.5 3.6 3.7 Context . . . . . . . . . . . . . . . . . Process . . . . . . . . . . . . . . . . . Disproportionality . . . . . . . . . . Shrinkage . . . . . . . . . . . . . . . Pattern discovery and detection . Facilitating interpretation . . . . . Future directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Paper I . . Paper II . Paper III . Paper IV Paper V . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 6 8 11 . . . . . . . 4 Overview of the papers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 4.2 4.3 4.4 4.5 1 3 12 13 15 17 19 20 22 25 . . . . . 25 28 30 31 32 35 37 Part I: Thesis summary 1. Introduction It is in the nature of pharmaceutical development that the full safety profile of a new medicinal product will not be known at the time it is introduced to the general public. Because randomised clinical trials are limited in both the types and numbers of patients exposed, continued safety monitoring of drugs is in the interest of patients, regulatory authorities and pharmaceutical companies (Finney 1966, Evans 2000). Individual case safety reports are submitted by health professionals based on suspected adverse drug reaction (ADR) incidents (Edwards and Aronson 2000) observed in real world clinical practice. They remain one of the best resources for early post-marketing discovery of potential public health or patient safety issues. They are rich sources of information, but anecdotal in nature. The reliance on voluntary submission, the variation in quality of information and the large number of new reports submitted to national and international organisations every year provide a range of interesting statistical challenges. 1.1 Aim The overall aim of this thesis is to propose improved statistical methods for knowledge discovery in collections of individual case safety reports. I proposes a new method for automated duplicate detection based on the hit-miss model introduced for statistical record linkage (matching records across data sets) by Copas and Hilton (1990). An extended hit-miss model that handles numerical record fields and compensates for correlations between record fields is implemented for the WHO database and demonstrated to be useful in real world duplicate detection. II proposes improved credibility intervals, a poststratification approach to adjustment for confounding variables and an extension to higher order associations for the Information Component (IC) measure of disproportionality used to screen the WHO database for excessive ADR relative reporting rates. III demonstrates that the post-stratification adjustment of the observed-to-expected ratio for suspected confounders adopted for the IC in II may lead to spurious underestimation in the presence of any very small strata in a stratified data set. A comparison to a literature reference indicates that while routine adjustment for some potential confounders in first pass screening of collections of individual case safety reports does improve 1 performance, the magnitude of this improvement is modest compared to the improvement from a triage (prioritisation) criterion requiring reports from at least two countries before a drug–ADR pair is highlighted for clinical review. This suggests that confounding may have less impact on the analysis of individual case safety reports than previously believed. IV introduces a new measure of drug–drug interaction for collections of individual case safety reports. Unlike methods proposed previously for this purpose, it defines interaction as departure from a baseline model with independent attributable risk. V introduces a Bayesian bootstrap method for estimating the uncertainty in Bayes classification associated with each individual prediction. We demonstrate how this information can be used to improve performance, when different types of errors have different associated loss, with a possible application in selecting drug–ADR pairs for detailed clinical review. 2 2. Adverse drug reaction surveillance The analysis of individual case safety reports is the cornerstone of early postmarketing ADR detection (Rawlins 1988). Whereas large, formal drug safety studies are useful to test specific hypotheses related to drug safety, they are not suitable for continuous monitoring with the aim of detecting previously unsuspected ADRs, as early as possible. In the context of this PhD thesis, ADR surveillance refers exclusively to drug safety monitoring based on individual case safety reports. It thus excludes other post-marketing efforts such as the intensive monitoring programs of New Zealand and the United Kingdom, as well as safety monitoring based on health registries and hospital-based safety monitoring. For comprehensive overviews of post-marketing ADR surveillance, see Lindquist (2003) and Bate (2003). 2.1 Individual case safety reports Individual case safety reports communicate genuine clinical concerns from observant health professionals (Edwards 1999). As they are based on actual patients in real world clinical practice, their collection and analysis increase the chance to discover ADRs that are due to drug–drug interaction, affect patients with certain medical predispositions or that belong to patient subgroups that tend to be excluded from pre-marketing clinical trials, such as children or pregnant women. In addition, the large numbers of patients exposed and the unlimited follow-up time available considerably increase the chance to detect ADRs that are rare or that occur only after extended periods of use. An example of an authentic individual case safety report is provided in Figure 2.1. Much of the information on these reports can be originally provided as free text, some of which is later encoded as structured information upon database entry. This is usually done by trained personnel at pharmaceutical companies or at national authorities. The encoding of observed ADR incidents in terms of standardised terminology is a critical part of the preprocessing. One potential pitfall is the risk of misinterpretation when the ADR encoding is performed by someone who has never actually met the patient. Variation in coding across regions and time periods may lead to systematic differences that can affect subsequent data analysis. A general problem is that several ADR terms are often applicable to a given incident. Thus, exploratory analysis 3 Figure 2.1: Sample individual case safety report. Reprinted with kind permission of the Adverse Drug Reactions Unit at the Therapeutic Goods Administration of Australia 4 focusing on single ADR terms may fail to include all relevant reports — a phenomenon which has been referred to as ‘signal fragmentation’ (Purcell 2003). While in the follow-up of specific issues, this can be remedied by specifying groups of relevant ADR terms for the issue of interest, it is not obvious how such strategies can be easily automated for routine exploratory analysis. Individual case safety reports refer to suspected ADR incidents and some adverse events observed in association with drug prescription will in reality be coincidental, due to concomitant medication or natural progression of the underlying disease. At the same time, not all ADR incidents that actually occur are identified as such and eventually reported to the national drug safety centres. The degree of under-reporting is unknown but can be expected to vary with the severity of the suspected ADR, across geographical regions and time periods. There may also be variation in the propensity to report suspected ADRs during the life-span of a drug and in response to any attention to suspected drug safety issues in the public or scientific media. The categories of health professionals who are allowed to submit reports also differ over time and between regions. Some countries allow only medical doctors to submit reports, whereas others accept reports from medical nurses and pharmacists as well. In addition, some countries encourage direct consumer reporting. Unsurprisingly, the propensity to report suspected ADRs of different types varies considerably between different categories of reporters (Savage 1985). An important characteristic of individual case safety report submission is that separate reports sometimes have a common origin and therefore cannot be considered as independent pieces of information (Finney 1973). This may distort automated knowledge discovery and mislead clinical review. The most obvious cause of non-independent reports is report duplication, where a single suspected ADR incident results in several reports. This phenomenon is discussed at some length in I. More subtle examples include groups of reports provided by the same health professional, such as those from the Norwegian dentist discovered in I, reports from the same clinical study (sometimes mislabelled as spontaneous reports) or separate reports for the same patient at different points in time. If single individuals are responsible for encoding large numbers of reports, this may also induce superficial similarity between reports. A potential example of this is the group of over 600 very similar reports originally collected by a single law firm, discovered in IV. Violated independence assumptions differ from other data quality issues in that they do not relate to the quality of single reports, but to the quality of collections of reports. Even upon the confirmation that a pair of reports are indeed duplicates it is not obvious how to proceed: should the suspected duplicates be flagged or should one of them perhaps be removed from the data set (if so, which one)? 5 Drugs ADRs Anatomical Therapeutic Class Reports System Organ Class Selective serotonin reuptake inhibitors (SSRI) 193,939 Body as a whole - general disorders 1,218,425 Antiinfl. prep. nonsteroids for topical use (NSAID) 180,770 Skin and appendages disorders 1,070,189 Platelet aggregation inhibitors excl. heparin 179,226 Gastro-intestinal system disorders 902,238 ACE inhibitors, plain 171,706 Central & peripheral nervous system disorders 853,883 Benzodiazepine tives 157,898 Psychiatric disorders 677,227 deriva- Reports Table 2.1: The most commonly reported groups of drugs and ADRs in the WHO database (note that each report may list more than one drug and more than one ADR). 2.2 The WHO database The Uppsala Monitoring Centre maintains and analyses the world’s largest collection of individual case safety reports. As of December 2006, the WHO database contained over 3.8 million reports, with a current yearly growth of over 200,000 reports (see Figure 2.2). The database is held on behalf of the countries participating in the WHO Programme for International Drug Monitoring, whose number has continued to grow from the founding 10 countries in 1968 to over 80 member countries at the end of 2006. The international coverage allows rare but important public health or patient safety issues to be detected earlier after drug launch than if based on isolated analysis of national data sets (Olsson 1998). Variation between countries in the range of available drug substances, populations at risk, reporting culture and regulation may influence relative reporting rates and make knowledge discovery more complicated in international data sets. At the same time, this diversity is an invaluable asset in detecting public health or patient safety issues related to for example ethnic or dietary ADR risk factors. Thus, even though most reports in the WHO database come from the USA and other industrialised nations, the worldwide coverage of the WHO programme is perhaps its greatest strength. As is clear from Figure 2.2, the vast majority of reports in the WHO database are so-called spontaneous reports that refer to observations in regular clinical practice. However, a small minority are from intensive monitoring programs or clinical studies. Such atypical reports should in principle be labelled as 6 Number of reports 4,000,000 3,000,000 2,000,000 1,000,000 0 1970 1975 1980 1985 1990 1995 2000 2005 Year a. Database growth United States United Kingdom Germany Canada France Australia Other countries 20% 5% Spontaneous reports Other 5% 5% 5% 47% 95% 6% 12% b. Biggest contributors c. Types of reports 2,000,000 Number of reports Number of reports 4,000,000 3,000,000 2,000,000 1,000,000 0 0 2 4 6 8 Number of drugs per report 1,500,000 1,000,000 500,000 0 10 d. Number of drugs per report 0 2 4 6 8 Number of ADRs per report 10 e. Number of ADRs per report Number of reports 50,000 40,000 30,000 20,000 10,000 0 0 10 20 30 40 50 60 Patient age (years) 70 80 90 100 110 f. Patient age distribution Figure 2.2: Characteristics of the WHO database 7 such, but occasional mislabellings do occur. Thus, they cannot reliably be excluded from the analysis. Table 2.1 indicates what groups of drugs and ADRs have been reported most often during the entire life span of the WHO database. From Figure 2.2, it is clear that most reports list only one suspected drug and between one and four ADRs, but there are reports that deviate from this general pattern, and list very large numbers of drugs and ADRs. The most striking aspect of the empirical age distribution in Figure 2.2 is perhaps the large number of reports for children less than two years of age. A large proportion of these relate to suspected adverse reactions to vaccines. Another interesting phenomenon is the digit preference on 0 and 5 for encoding patient age. 2.3 Adverse drug reaction signal detection The detection of early warnings related to potential public health or patient safety issues is the main aim of collecting and analysing individual case safety reports. In the context of ADR surveillance, the WHO defines a signal as: "Reported information on a possible causal relationship between an adverse event and a drug, the relationship being unknown or incompletely documented previously. Usually more than a single report is required to generate a signal, depending upon the seriousness of the event and quality of the information." (Edwards and Biriell 1994) As is clear from the definition, single reports in isolation rarely motivate the communication of an early warning of a potential ADR, but there are exceptional examples where single reports of very high quality do (Meyboom et al. 1997). Particularly valuable pieces of information in this respect are those that indicate the effect on the ADR of withdrawing the suspected medication (socalled dechallenge intervention), and the effect of re-exposing the patient to the suspected treatment, after a successful dechallenge (so-called rechallenge intervention) (Edwards et al. 1990). Moreover, Aronson and Hauben (2006) argue that there are certain types of ADRs for which single, well documented incidents may motivate early warning, much in the spirit of the triage algorithms proposed by Ståhl et al. (2004). Early warning of a potential ADR is possible even in the absence of any individually very strong reports, if there is a large enough number of reports on the drug–ADR pair of interest (Edwards et al. 1990). This is true in particular when alternative systematic explanations to excessive reporting rates, such as reporting biases or strong confounding, can be dismissed and the relative 8 Figure 2.3: Signal detection process reporting rate remains excessive even after suspected duplicates have been removed. The aim of ADR signal detection is to generate, strengthen and refine hypotheses related to suspected drug toxicity. Hypothesis testing is not possible on account of the inherently non-systematic nature of data collection and the lack of proper comparison groups. In-depth clinical evaluation and scrutiny of reports remain at the core of the ADR signal detection process. However, the WHO database receives tens of thousands of reports every month and this massive inflow of reports require efficient computational methods to help clinical experts focus on the groups of reports most likely to represent important public health or patient safety issues (Meyboom et al. 2002). As indicated in Figure 2.3, the signal detection process in routine use on the WHO database consists of a combination of automated knowledge discovery methods (Bate et al. 1998), triage (prioritisation) algorithms and clinical review (Ståhl et al. 2004). The knowledge discovery methods highlight drug– ADR pairs with unexpectedly large numbers of reports relative to the average reporting rates in the database. Triage algorithms use a combination of quantitative and qualitative information to focus attention on the most urgent issues for follow-up (Ståhl et al. 2004). Reports related to drug–ADR pairs singled out by the triage algorithm are forwarded to a panel of international experts for clinical review. In the context of the clinical review, pattern discovery methods may often be useful to profile larger groups of reports and suggest alternative explanations to observed excessive reporting rates. Hypotheses of suspected ADRs first highlighted in automated knowledge discovery that remain after clinical review are routinely communicated to the drug safety community, and some have been published in the mainstream medical literature (Coulter 9 et al. 2001, Sanz et al. 2005). However, the risk of distortion from undiscovered data quality problems and the difficulty of obtaining complete, detailed information on reported ADR incidents mean that signals of suspected ADRs often remain tentative, even after clinical review. 10 3. Knowledge discovery in adverse drug reaction surveillance Vast improvements in data storage capacity over the last decades have spurred ever increasing ambitions to analyse large, complex data sets not originally collected for the purpose of statistical analysis. Such investigations require data analysis methodology that scales well with increasing amounts of data and that focuses on discovery and exploration rather than on inference. This area of research and application, on the border between mathematical statistics and computer science, is referred to as knowledge discovery or data mining. Fayyad et al. (1996) describe data mining as one step in a more general knowledge discovery process. Mannila (1996) and Hand (1998) emphasise the similarity between data mining and exploratory statistical analysis, the latter characterising the difference as one primarily related to data set size and properties: in data mining, contamination, nonstationarity and biases are standard. On account of the complex data sets involved, interpretability is often a main consideration, which may favour simplicity at the expense of prediction accuracy (Glymour et al. 1997). An important dividing line is the choice between model based inference and algorithmic approaches (Breiman 2001). Whereas much of the research on knowledge discovery has been driven by computer science, key contributions from the statistical community include the clarification of inferential processes underlying algorithmic methods, insight into the bias–variance trade-off in determining model complexity, methods for quantifying uncertainty and placing emphasis of the impact on interpretation of potential distortions such as confounding or selection biases (Elder and Pregibon 1996, Glymour et al. 1997, Efron 2001). In contrast with the more rigid framework for hypothesis testing, knowledge discovery is usually an interactive and iterative process of increasingly refined hypothesis generation. In my view, it should combine an unintimidated attitude towards the analysis of problematic and complex data sets with a proper understanding and clear statement of the limitations in nature and strength of the conclusions that can be drawn. 11 3.1 Context Collections of individual case safety reports clearly contain important pieces of rich and very useful information (Finney 1973, Edwards 1997), but they constitute an inherently non-random sample. The presence of reporting biases and violated independence assumptions discussed in Section 2 render summary statistics potentially deceptive. In particular, the presence of nonindependent reports can lead to optimistic precision estimates and invalidate standard tests for association (Finney 1971). As a consequence, the place for statistical methodology in the analysis of collections of individual case safety reports is somewhat out of the ordinary. Its main focus is on providing a framework for effective hypothesis generation and refinement, rather than on hypothesis testing (Bate 2003). Methods for reliably identifying elevated ADR reporting rates in collections of individual case safety reports are already part of routine drug safety signal detection (Ståhl et al. 2004). In the future, methods for highlighting suspected drug–drug interaction, groups of nonindependent reports or reporting patterns involving larger sets of drugs and ADRs should allow for even more sophisticated use of this valuable source of information. The emphasis on hypothesis generation and refinement applies throughout this thesis: the aim of the record matching algorithm in I is to highlight likely duplicates for manual review and the aim of II, III and IV is to determine the most effective approach to highlighting apparently excessive ADR reporting rates for further follow-up. The purpose of implementing the methods in V for prioritisation of drug–ADR pairs for further follow-up as discussed in Section 4.5 would also be effective hypothesis generation. In knowledge discovery, large numbers of possible associations and patterns are considered simultaneously. Familywise error rates that reflect the probability that any highlighted association corresponds to a false positive are usually less relevant in this context, because all open-ended investigations are bound to produce some false positives. Performance is better evaluated in terms of measures that indicate the proportion of false positives that can be anticipated in a specific study, such as false discovery rates. In our work, we have used two related measures of performance from the literature on Information Retrieval: precision (the number of true positives over the sum of true and false positives) and recall (the number of true positives over the sum of true positives and false negatives). Precision–recall graphs that indicate how the precision and recall vary by the threshold for clinical review are used in both I and III. They provide an informative overview of performance, independent of the selected threshold. 12 Figure 3.1: Exploratory analysis of collections of individual case safety reports 3.2 Process Fayyad et al. (1996) define knowledge discovery as: "The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data." The knowledge discovery process is not limited to actual data analysis but includes: data collection, cleaning and preparation; reduction and projection; data analysis and interpretation, and finally dissemination, incorporation into existing structures and action based on discovered knowledge. It thus entails the entire ADR signal detection process outlined in Section 2.3, from the collection of reports and their pooling in an international database, through data preparation and transformation including conversion from free text to structured information, data cleaning and duplicate detection, via disproportionality analysis and triage algorithms to clinical review, and finally communication to national centres, pharmaceutical companies and the general public. The statistical methodology developed in the context of this thesis is applied at two different stages of the knowledge discovery process for ADR surveillance, as indicated in Figure 2.3. On one hand, disproportionality analysis is a core component in screening for excessive ADR reporting rates in first pass analysis of the database. On the other hand, pattern discovery methods are 13 useful in assisting clinical review and highlighting interesting aspects of specific groups of reports in more detailed investigations. Figure 3.1 proposes a general framework for such exploratory analysis. For the purpose of illustration, assume that the data subset of interest consists of all reports involving a particular drug D. At the outset of the exploratory analysis, simple descriptive information such as the total number of reports listing D and from what countries and during what time periods they have been submitted, may be very useful. Together with lists of the most commonly co-reported drugs and ADRs, as well as empirical distributions for patient age and gender, this provides a descriptive overview of the reporting of D which can serve as a useful reference for subsequent discoveries. Experienced data analysts may react directly to descriptive information that contradicts their subject matter knowledge. For example, a domain expert familiar with the WHO database may react to the observation that a suspiciously large proportion of the reports in a subgroup of interest have been submitted from a country with a low overall reporting rate. The middle box in Figure 3.1 is an attempt to formalise such comparative data analysis. Contrasts between the group of reports of interest and a comparison group (e.g. the database as a whole or all reports involving a drug in the same class of drugs as D) provide insight into what properties of the data subset differentiate it from the comparison group. For example, it may turn out that the relative reporting rate of a rare ADR for D by far exceeds that in the database as a whole. Such discrepancies may well be more enlightening than information on what the most commonly reported ADR is in absolute terms. The discussion of such disproportionality analysis is further extended in Section 3.3. Both descriptive and comparative studies may be misleading when the group of interest contains distinct subgroups. For example, if D is prescribed on one hand to young males and on the other hand to elderly females, the summary information that the average patient age on reports listing D is 43 years and that the overall proportion of females is 52%, conveys a very insufficient overview. Clustering algorithms allow for automated partitioning of data, with the aim of detecting latent structure, and may allow for much more relevant subsequent descriptive or comparative data analysis, as indicated by Figure 3.1. In addition to the iteration of automated partitioning, description and comparison described above, there are other methods for pattern discovery in collections of individual case safety reports. Record matching methods such as that adapted for duplicate detection in I can be used to detect groups of unexpectedly similar reports. Modified Hopfield networks and clustering algorithms such as those evaluated in Orre et al. (2005) may allow groups of often recurring ADRs (syndromes) to be identified. Similarly, interaction detection methods such as those in II and IV can be used to highlight suspected ADR risk factors. 14 It is rarely possible to specify at the outset of a large exploratory study, a fully automated, all-purpose approach to exploratory data analysis appropriate for all possible questions and patterns of potential interest. In addition, knowledge discovery often produces results that relate not to the primary study objective, but to fundamental properties of the data or of the data collection process. Thus, data cleaning and analysis are in practice intertwined, so that the correction of a data quality problem highlighted in initial data analysis allows for more refined subsequent data analysis. For example, in screening the WHO database for reporting patterns indicative of suspected drug interaction in IV, some larger groups of non-independent reports were highlighted. Their removal may allow for more accurate subsequent studies of drug–drug interaction in the WHO database. 3.3 Disproportionality The frequency or relative frequency of a certain event (or set of events) in a database is sometimes of direct interest. However, in many knowledge discovery applications the discrepancy between the observed (relative) frequency and its expected value under some baseline model is of greater interest. An example from the analysis of purchasing patterns in supermarket sales data is that even if milk is the product most commonly purchased together with the product of interest, because this is true of most products, it may be more enlightening to point out that, for instance, grapefruit juice is purchased four times as often together with the product of interest as overall in the database. Such contrasts provide the basis of disproportionality analysis, which focuses on identifying events whose relative frequency in a given subgroup deviates substantially from the relative frequency of the same event in a given comparison group. Most modern methods for screening collections of individual case safety reports for excessive ADR reporting rates are based on disproportionality relative to the rest of the database. This is true of the Information Component (IC) (Bate et al. 1998), the Empirical Bayes Geometric Mean (EBGM) (DuMouchel 1999), the Proportional Reporting Ratio (PRR) (Evans et al. 2001) and the Reporting Odds Ratio (ROR) (Egberts et al. 2002). All these measures compare the number of reports on a certain drug–ADR pair to an expected number of reports conditional on the overall reporting rates for the drug and the ADR in the database. The original idea of making comparisons with the database itself as reference goes back to the early days of ADR surveillance (Patwary 1969, Finney 1974). In addition to the lack of reliable external estimates for the international usage of different drugs, an advantage of disproportionality analysis is that marginal reporting biases that affect only the drug or only the ADR, cancel out (at least approximately) in 15 a measure of disproportionality. Thus, even though the reporting rates are likely to be higher for serious than for harmless ADRs, this does not have a considerable impact on the measures of disproportionality, as long as the reporting bias affects all drugs to an equal extent. The main drawback of disproportionality measures is that they rely on comparison to the reporting of other drug–ADR pairs. Thus, if a particular drug–ADR pair is massively reported, it will inflate the overall reporting rates for both the drug and the ADR, sometimes to the extent that excessive reporting rates for the same drug with another ADR or for another drug with the same ADR are masked (Evans 2004, Hauben et al. 2005). Assume the following contingency table based on the cross-classification of reports according to whether they involve a drug x and an ADR y: y not y x a b not x c d The basis for pairwise disproportionality analysis in the WHO database is an observed-to-expected ratio OE contrasting the relative reporting rate of y given x to the overall relative reporting rate of y in the database. With the annotation used in the above contingency table, the observed number of reports on y given x is a, and the expected number of reports conditional on the table marginals is the product of the marginal relative reporting rate of y and the toa+c tal number of reports on x: a+b+c+d · (a + b). The observed-to-expected ratio is: OE = a/(a + b) (a + c)/(a + b + c + d) (3.1) The same measure of disproportionality has been used also in the context of association rule analysis (Agrawal et al. 1996), where it is referred to as the lift or the interest of an association rule involving x and y (Silverstein et al. 1998, Hastie et al. 2001). The similarity between the observed-to-expected ratio and other measures of disproportionality proposed for the analysis of individual case safety reports is clear. The Proportional Reporting Ratio (PRR) based on the above contingency table is (Evans et al. 2001): PRR = a/(a + b) c/(c + d) (3.2) and the corresponding Reporting Odds Ratio (ROR) is (Egberts et al. 2002): ROR = 16 a/b c/d (3.3) The IC measure of disproportionality used in routine knowledge discovery for the WHO database is essentially a conservative version of log2 OE , that tends to 0 for rare drug–ADR pairs. The moderation in magnitude is referred to as shrinkage (for details see Section 3.4 below). The availability of thoroughly evaluated shrinkage measures is the main advantage of the OE ratio over the PRR and the ROR. Other strengths are the link to Bayes classifiers described in Norén (2005) and the somewhat better robustness to zero counts in the contingency table than for the PRR and ROR (van Puijenbroek et al. 2002). The main limitation is that the observed-to-expected ratio provides a less distinct contrast between the group of interest and the reference group by including the group of interest in the reference. Another limitation is that the observed-toexpected ratio for a given pair of events by definition cannot exceed the inverse of the marginal relative reporting rate for each event. For example, if one of the events has an overall relative reporting rate of 0.5, then observed-to-expected ratios involving this event can at most reach 2 (if the relative reporting rate of the first event conditional on the other event is 1.00). In practice this limits the usefulness of observed-to-expected ratio as a measure of disproportionality to events that are reasonably rare. While disproportionality analysis is usually carried out at an early stage of the exploratory analysis of collections of individual case safety reports, there are sometimes requests to compute a measure of disproportionality for a drug– ADR pair highlighted for review based on clinical judgement or on account of one or a few very strong reports. If the drug–ADR pair turns out to be disproportionally reported, this may indeed lend added support. However, observed disproportionality must always be interpreted with caution. The possibility of alternative explanations such as report duplication, violated independence assumptions, publication biases or confounding must always be analysed and clearly stated. 3.4 Shrinkage Shrinkage is an attempt to regularise and reduce the volatility of a measure or parameter estimate of interest, by trading an increase in bias for a decrease in variance. In large and sparse data sets such as national or international collections of individual case safety reports, raw measures of disproportionality tend to sometimes yield very large values based on extremely low numbers of reports, but disproportionality based on just 1 or 2 reports is rarely of practical interest. The problem is that for rare drugs and ADRs, the expected number of reports may be very close to 0, relative to which even a single observed report may constitute a substantial deviation. Very low expected numbers of reports occur in the analysis of collections of individual case safety reports because the 2 by 2 contingency table of Section 3.3 is usually very unbalanced. Even 17 Figure 3.2: The simplified IC shrinkage measure plotted against the standard IC shrinkage measure for 10,000 randomly selected drug–ADR pairs in the WHO database for the most common drugs and ADRs in the WHO database, the number of reports that do not involve either the drug or the ADR, d , is around 3, 000, 000, whereas b and c are generally in the order of 100 or 1, 000 and a is even smaller (a ≤ 10 for 80% of the drug–ADR pairs in the database). In order to reduce the vulnerability to spurious associations, two shrinkage measures of disproportionality have been proposed for the analysis of collections of individual case safety reports: the IC (Bate et al. 1998) and the EBGM (DuMouchel 1999). These measures of disproportionality are versions of the (logarithm of the) observed-to-expected ratio in (3.1) moderated towards a baseline value in the absence of large amounts of data. For the IC, the baseline value is 0 which corresponds to an observed-to-expected ratio of 1. Such shrinkage provides a robust measure of disproportionality moderated towards less extreme values for rare drugs and ADRs. However, as data accumulates it tends to log2 OE as desired. The IC shrinkage measure is defined in II as a Bayesian maximum à posteriori estimate of a parameter related to the logarithm of the observed-to-expected ratio in (3.1). It is well approximated by the following simplified shrinkage measure based on observed and expected counts Oxy and Exy : IC ≈ log2 Oxy + 1/2 Exy + 1/2 (3.4) A comparison between the IC shrinkage measure in II and that in (3.4) for 10,000 randomly selected drug–ADR pairs in the WHO database is presented in Figure 3.2. Clearly, the difference between the two shrinkage measures is negligible. The main advantages of the simplified IC shrinkage measure are 18 that it is easier to compute and that it provides a general recipe for shrinkage that can be applied to any measure expressed in terms of an observed-toexpected ratio, such as the Ω measure of drug–drug interaction in IV. This shrinkage can also be implemented for the PRR and ROR, after re-expression in terms of observed-to-expected ratios with Oxy = a and Exy = (a+b)c c+d for the bc PRR, and with Oxy = a and Exy = d for the ROR. Empirical Bayes estimation provides an alternative framework for shrinkage, where the prior distribution for a group of parameters is estimated based on the empirical distribution of maximum likelihood estimates for the group. The main advantage of empirical Bayes estimators is that they borrow strength from similar observations to improve the overall accuracy. However, with respect to each parameter, its estimate will only improve under the assumption that it is indeed related to the other parameters. Unlike the IC prior distribution, an empirical Bayes prior for the observed-to-expected ratio will not necessarily be centred at 1, and thus may inflate individual disproportionality measures rather than shrink them towards less extreme values. A practical issue is that for drug–ADR pairs that have never been co-reported, the maximum likelihood estimate of the observed-to-expected ratio is 0. In practice, these drug–ADR pairs appear to be ignored in the estimation of the empirical prior distribution in DuMouchel (1999), and the potential bias due to this is unclear. Berry and Berry (2004) propose a hierarchical empirical Bayes estimator for the observed-to-expected measure of disproportionality, where each measure of disproportionality is shrunk towards the group mean for a smaller group of more closely related ADRs. This should allow for more sophisticated empirical Bayes shrinkage, but the identification of appropriate groups of related ADR terms remains a challenging research problem in its own right. 3.5 Pattern discovery and detection Pattern recognition is the attempt to partition a group of data points into classes, based on a given set of explanatory variables (Webb 2002). Distinction is made between supervised and unsupervised pattern recognition: in supervised pattern recognition (or discrimination) a classifier is constructed based on training data consisting of labelled data points with the aim of accurately categorising unseen data points; in unsupervised classification (or clustering), the aim is to identify a natural partitioning of the available data set, without labelled training data available, or even a specification of what the classes of interest may be. For our purposes, the distinction between patterns and models in the context of pattern discovery and detection is more relevant. Hand and Bolton (2004) characterise patterns as related to local features of a data set involving only 19 subsets of the data points and/or subsets of the variables. Whereas a global model provides a high level description of the most important general features of a data set, a pattern may highlight one or a few outlying observations or a strong correlation between two variables. Hand and Bolton (2004) propose the following general definition: "A pattern is a local structure that generates data with an anomalously high density compared with that expected under the (global) baseline model." The focus on deviation from a global baseline model applies broadly to the methods described in this PhD thesis. The very aim of disproportionality analysis, is to identify groups of events that are co-reported more often than would be expected, based on a baseline independence model. Similarly, in duplicate detection and other record matching applications, the aim is to identify pairs (or small subsets) of unexpectedly similar reports whose similarity deviates from a global baseline model assuming all reports have been submitted independently. With the exception of the work on Bayes classifiers in V (which relates primarily to supervised pattern recognition by the above definition), this thesis focuses on unsupervised pattern discovery. The aim is to discover structure in data, without strict à priori specification of what the structure of interest is. At the same time, completely open-ended hypothesis generation is not possible as the type of potential patterns is determined by the choice of pattern discovery method, as well as implicitly by a range of other choices such as the variables considered in a given study (Hand 1994, p 319). Thus, while disproportionality analysis may highlight a variety of patterns related to anything from a suspected drug–ADR association to an elevated reporting rate of a certain drug in one particular country, the type of patterns in such studies is restricted to unexpectedly high (or low) relative reporting rates. Similarly, record matching may highlight a variety of non-independent reports, but all highlighted patterns will refer to unexpected report similarity. 3.6 Facilitating interpretation Interpretation is one of the final steps in the knowledge discovery process (Fayyad et al. 1996), and a key component of the ADR signal detection process. Transparency is of particular importance in the analysis of non-systematically collected data such as individual case safety reports, where the use of overly complex statistical methodology may give a false sense of security and distract domain experts from limitations with the data (Hauben et al. 2005). Breiman (1985) refers to the application of 20 advanced statistical methodology to hide inadequacies with the data as ‘edifice building’; in the ADR signal detection process, the use of overly complex statistical methods may divert clinical experts from careful consideration of alternative explanations to apparently excessive ADR relative reporting rates. Since the primary aim of applying knowledge discovery methods to collections of individual case safety reports is to guide and support domain experts in their manual review, better transparency is a strong argument in favour of choosing a simple method over a more complicated one. Indeed, better transparency is perhaps the strongest argument for choosing the simple IC shrinkage measure over the more complicated one as discussed in Section 3.4. Statistical sophistication does not necessarily rule out transparency, however. The hit-miss model record matching algorithm in I is based on a rather intricate probabilistic model, but its basis for highlighting a given record pair as suspected duplicates is immediately clear from an overview such as that presented in Figure 4.2 of Section 4. While sophisticated statistical methods are sometimes required to make the most of the available data, knowledge discovery results should always be presented as transparently as possible. For example, while shrinkage measures of disproportionality have proved a very powerful basis for filtering individual case safety reports for interesting reporting patterns, they may confuse domain experts, with little interest in the statistical methodology. Moreover, it is difficult to evaluate the impact of data quality issues such as suspected duplication or reporting biases on shrinkage measures of disproportionality. Observed and expected counts provide a more transparent explanation for why certain drug–ADR pairs have been highlighted for manual review. In the presence of suspected data quality issues, simple arithmetic will indicate to what extent an excessive reporting rate may be due to a group of suspected duplicates, for instance. At the same time, domain experts often do want a sense of whether an observed disproportionality is likely to be due to chance or not. Credibility intervals around measures of disproportionality give some such indication, although the potential for violated independence assumptions means that precision can be overestimated. Adjustment for potential confounders may complicate interpretation of shrinkage measures of disproportionality. However, as commented on in III, adjusted observed-to-expected ratios sometimes correspond closely to stratum specific ones, and translating adjusted observed-to-expected ratios to stratum specific ones may simplify interpretation. For example, in the example on hypertension and zimeldine in Appendix E.4 of Hopstadius (2006), the IC increases from -0.33 to +1.65 when adjusted for time of reporting and country of origin. A closer investigation of the detailed data available in Appendix F.2 of the same thesis indicates that the discrepancy 21 is due to hypertension being more than twice as common on US reports (1.7%) as on reports from other countries (0.7%), whereas zimeldine was never used in the USA. Additionally the overall relative reporting rate of hypertension has increased in recent years, whereas zimeldine was primarily used in the early 1980’s. Thus, the crude IC which contrasts the observed relative reporting rate of hypertension given zimeldine to the overall relative reporting rate of hypertension in the entire database underestimates the disproportionality. Arguably, the best information to present to clinical experts in this case would be the observed number of reports on hypertension for zimeldine, and the expected number of such reports based on the relative reporting rate of hypertension in the countries and time period in which it was available. To guide domain experts to appropriate interpretation is clearly as important a challenge as method development in knowledge discovery research. 3.7 Future directions The new methodology proposed in this thesis provides a strong basis for future improvement and further research on knowledge discovery methods for collections of individual case safety reports. The method for drug–drug interaction detection goes beyond simple drug–ADR disproportional reporting rates, and could potentially be used also to screen for other types of ADR risk factors, such as related to patient gender or age. In general, we must aim to make better use of the rich information available on individual case safety reports. Virtually all knowledge discovery methods of today (including those described in this thesis) are based on raw numbers of reports (Hauben et al. 2005). They do not account for the amount or quality of information on each report nor for suspected duplication. This is in stark contrast with clinical review, in which both the quality of single reports and the quality of sets of reports as a group is carefully scrutinised (Meyboom et al. 1997). Indeed, given that the overall aim of applying knowledge discovery methods to collections of individual case safety reports is to assist and direct clinical review, an important challenge for the future is to achieve better alignment between automated knowledge discovery and clinical review. A first step may be to develop new and improved quality criteria for individual case safety reports similar to those discussed by Edwards et al. (1990). Based on such quality criteria, the number of high quality, distinct reports referring to a particular drug–ADR pair can be identified and potentially provide a useful triage criterion. The possibility to highlight single high quality reports is interesting in its own right. The extended hit-miss model record matching algorithm has proved very useful for duplicate detection in the WHO database. Its importance is likely to 22 increase even further in the future, as new categories of health care professionals, and even patients, are invited to submit reports. In addition, the hitmiss model record matching algorithm sometimes highlights non-independent reports other than pure suspected duplicates. Non-independent reports distort data analysis, and their identification is important both for effective first pass screening and for clinical review where the consideration of a group of related reports as independent pieces of information is potentially deceptive. For this purpose, an adapted hit-miss model record matching algorithm should ideally be developed explicitly for the purpose of detecting non-independent reports other than pure duplicates. A main challenge is how to incorporate, in subsequent data analysis, the information that some reports are suspected to be related. One might conceive of an extended disproportionality analysis where reports were weighted according to whether they are part of a suspected cluster or not. Given the tedious process of having suspected duplicates confirmed and removed from collections of individual case safety reports, the same approach could perhaps be used also to account for suspected duplication, in first pass screening. In a similar spirit, reports could perhaps also be weighted by their quality of information. Another important challenge for the future is to further advance the methods for exploring patterns involving large groups of drugs and ADRs in collections of individual case safety reports. In Orre et al. (2005), we use a Hopfield type network and a mixture model based probabilistic clustering algorithm to identify suspected ADR syndromes in the WHO database. The main challenge is that while each syndrome may consist of a large group of ADRs, each report tends to include only a small subset of these, so training data is both noisy and incomplete. Pattern discovery in high-dimensional binary data has been studied in other application areas such as market basket analysis and document retrieval (Bingham et al. 2002), and this research provides a good starting point for further development in our area. An interesting generalisation of the mixture model based clustering algorithm, for high-dimensional binary data, is the subspace clustering method proposed by Patrikainen and Mannila (2004), which models only the most characteristic attributes for each class. For the discovery of reporting patterns based on smaller groups of reports, the hit-miss model based record matching algorithm may potentially prove useful. Its advantage is that it does not attempt to build a global model, but searches for groups of unexpectedly similar reports, based on pairwise comparison. The importance of individual case safety reports for early post-marketing discovery of previously undetected drug toxicity is clear. At the same time, these data sets are not optimal for all types of ADR-related knowledge discovery. Specifically, each report constitutes a snapshot in time, and any information on the patient’s previous medical history is limited, at best. Therefore, it is difficult to evaluate the potential impact of channelling effects, where those 23 patients that do not respond favourably to one medical treatment are systematically switched to a specific other treatment. Similarly, individual case safety reports usually do not provide enough information to determine whether there are differences in the severity of the underlying disease between patients prescribed different drugs. Yet another limitation with individual case safety reports is that adverse events without clear temporal association with the prescription of the drug are difficult to identify as suspected ADRs, in particular if the background incidence of the adverse event is high (Meyboom et al. 1997). As a consequence, longitudinal patient records listing patients’ entire medical histories are a very interesting complementary source of information. Combined, individual case safety reports and longitudinal patient records may allow for more comprehensive ADR related knowledge discovery. While the methodology proposed in the context of this thesis has been developed specifically for the exploratory analysis of collections of individual case safety reports, some of it may be relevant also for the analysis of longitudinal patient records. Specifically, the method for interaction detection introduced in IV can be adapted to longitudinal patient records, and the proposed framework for exploratory analysis outlined in Figure 3.1, should, with some modifications, apply also to longitudinal patient records. 24 4. Overview of the papers This thesis is based on five original contributions. The order in which they are presented corresponds roughly to their natural order of application in the knowledge discovery process for ADR surveillance. I focuses on improving data quality through identifying suspected duplicate reports. II, III and IV propose improvements to, and evaluate different aspects of, disproportionality analysis for individual case safety reports. Finally, V proposes a bootstrap method to estimate the uncertainty in each prediction of a Bayes classifier. Historically, II and V are based on related work on Bayesian bootstrap analysis in 2003. An earlier version of II was presented at the 25th annual conference of the International Society for Clinical Biostatistics in Leiden, the Netherlands, 2004. The duplicate detection algorithm in I was developed during 2004 and 2005, and a shorter version of this paper was presented at the Eleventh International Conference on Knowledge Discovery and Data Mining in Chicago, 2005. The evaluation of the adjusted observed-to-expected ratio in III was performed during 2005 and 2006, and the statistical methodology for drug–drug interaction detection in IV was developed during 2006. The aim of this section is to provide a conceptual overview of the five papers. 4.1 Paper I Good data quality is a prerequisite for effective data analysis (Kim et al. 2003, De Veaux and Hand 2005). One important data quality problem in collections of individual case safety reports is that of report duplication. Duplicate reports are unlinked reports related to the same ADR incident, perhaps provided by different health professionals or by the same health professional to different drug safety centres. Their presence is a problem in the analysis of individual case safety reports because the total number of reports on a particular drug–ADR pair is both the basis for automated knowledge discovery and an important piece of information in clinical review of potential drug safety signals. When a single suspected ADR incident yields several reports, this may divert the analysis. Some studies indicate that duplicates may account for as large a proportion as 5% of all reports. More importantly, suspected report duplication appears not to be evenly spread in the data set, but whereas most reports have no suspected duplicates, a small minority have several. Require- 25 True value a T b X Y Observed value on first report Observed value on second report 1-a-b Miss ? Blank − Hit T Figure 4.1: The hit-miss model ments and regulations selectively stimulate reporting of previously unknown and serious ADRs, and may also increase the risk of duplicate reports related to such incidents (R. H. B. Meyboom, personal communication). The identification of suspected duplicates is thus an important step towards improved data quality and, ultimately, more effective automated knowledge discovery as well as better informed clinical review. The identification of suspected duplicates in collections of individual case safety reports is a difficult challenge. Duplicate reports will often either have been submitted by different individuals or processed in different reporting systems, and as such can be superficially very dissimilar. Different ADR terms may have been used to encode the same incident, patient information may be erroneous or incomplete and the listed drugs may differ between reports related to the same incident. Therefore, simple rule based methods are usually insufficient to reliably detect suspected duplicates. The duplicate detection method proposed in I is based on the hit-miss model for statistical record linkage introduced by Copas and Hilton (1990). The hit-miss model provides a probability model for how discrepancies between related database records occur. It allows for flexible and robust record matching in the presence of a large variety of errors. Under the hit-miss model, each observed value X on a database record (for example a listed patient gender on a report) is based on a true but unobserved value T = t (in this case the true gender of the patient). Observed values on related records are assumed to have been generated in independent identically distributed random processes resulting in i) a miss (with respect to the true value) with probability a, ii) a blank with probability b, or iii) a hit with probability 1 − a − b (see Figure 4.1). For a miss X is a random value independent of T but following the same distribution, for a blank the value of X is missing and for a hit X = t . Hits and misses are unobservable events of an assumed data generating process. In screening for suspected 26 Tachycardia ventricular 2002-02-07 ? 62 years Norway Sertraline Mirtazapine 2002-02-07 Female 60 years Norway Sertraline Mirtazapine Zopiclone Tachycardia ventricular = ? ≠ = = = ≠ = +12.0 ±0 -0.2 +7.2 +6.1 +8.7 -2.3 +8.1 Compensation for correlation between sertraline, mirtazapine and tachycardia -1.4 +38.2 Figure 4.2: Hit-miss model based scoring of a sample record pair duplicates we make comparisons between distinct reports, based on whether they have matching or mismatching information. Duplicate detection in the WHO database is based on patient gender, patient age, outcome, country of origin, date of onset, as well as all listed drugs and ADRs. For each record field, a match weight is calculated based on the likelihood ratio for the observed matching event under the assumption that the two records under study i) relate to the same underlying ADR incident or ii) are unrelated. The total match score is obtained by adding together the match weights for all record fields, as illustrated in Figure 4.2. It can be shown that, under the hit-miss model, matches always receive positive weights, mismatches receive negative weights and missing information on either report results in a match weight of 0. Moreover, matches on rare events receive higher match weights than matches on common events. This is an appealing property since chance matches between unrelated record fields are more likely on common events. The penalty for a mismatch is constant for a given record field but varies between record fields depending on how many mismatches were observed in each record field in the available training data (consisting of confirmed pairs of duplicate reports). Thus, in screening for suspected duplicates, mismatches in error prone record fields are penalised less than mismatches in record fields that are usually reliable. In I, we propose two methodological improvements to the standard hit-miss model: a hit-miss mixture model for numerical record fields and an adjustment of the overall match score for violated independence assumptions between matching record fields. The hit-miss mixture model extends the hit-miss model by including the possibility of imperfect matches in numerical record fields, which are less detached from the true value than complete misses. Deviations follow a narrow distribution centred at the true value. The compensation for violated independence assumptions is based on an IC dispropor27 tionality measure for the overall co-occurrence of two matching events in the database. It reduces the total match score for groups of matched events that occur together more often in the database than would be expected under the assumption of independence. The greatest strengths of the extended hit-miss model are that it provides transparent and intuitive match weights and that its parametrisation allows for robust fitting also in the absence of large numbers of confirmed duplicates. Because suspected duplicates can be reliably confirmed or refuted, the performance of a proposed duplicate detection method can be easily evaluated. In I, we demonstrate that the extended hit-miss model is able to identify with high accuracy (94.7% in our test data set), the most likely duplicate for a given database record. We also show that it effectively discriminates pairs of true duplicates from random matches. In a batch of 1559 Norwegian reports that included 19 confirmed duplicates, the extended hit-miss model identified 12 of the 19 already known duplicates (corresponding to a 63% recall) while additionally highlighting two pairs and one set of three reports as suspected duplicates that were not originally labelled as such (corresponding to a nominal 71% precision). Out of the additional suspected duplicates, one pair was later confirmed by the Norwegian national centre as a set of true duplicates, the other pair remains a set of suspected but unconfirmed duplicates and the set of three suspected duplicates turned out to be separate reports on the same drug–ADR pair submitted by the same dentist, but for three distinct patients. 4.2 Paper II The IC measure of disproportionality discussed in Sections 3.4 and 3.3 is the basis for routine screening of the WHO database to highlight excessive ADR reporting rates. In its original implementation (Bate et al. 1998), the IC only allowed for the identification of pairwise disproportionality (typically between one drug and one ADR). It relied on large sample approximations to compute credibility intervals and did not accommodate adjustment for suspected confounders. In response to these issues, II proposes credibility intervals accurate also for small samples, adopts a post-stratification approach to adjust for suspected confounders and introduces a simple extension to higher orders for the IC measure of disproportionality. The overall aim of these improvements is to allow more sophisticated and reliable screening for disproportional reporting rates in the WHO database. The credibility intervals for the IC proposed in Bate et al. (1998) were based on a normal approximation to the posterior IC distribution. In II, we demonstrate by precise Monte Carlo simulation that this is often not accurate enough. As an alternative, we propose an approximate formula for computing credibil28 ity intervals of the posterior IC distribution that is accurate also for rare events. It may seem counter-intuitive that the use of small sample methods should be necessary in the analysis of a data set with nearly 4 million records. However, as the focus turns to specific drug–ADR pairs, the number of relevant reports decreases very rapidly. Among the around 720,000 drug–ADR pairs ever coreported in the WHO database, more than 320,000 are co-reported only once, and an additional 106,000 only twice. More than 80% are co-reported less than 10 times. In the context of the variety of challenges involved in analysing these data sets, the importance of very accurate credibility intervals is perhaps limited, but one practically useful aspect of the refined credibility intervals proposed in II over those in Bate et al. (1998) is that they allow examples where the first three reports on a new drug all refer to the same ADR to be highlighted. This may allow for very early warning of some suspected ADRs. There may be a need to eliminate the impact of suspected confounders in disproportionality analysis. In II, we adopt a post-stratification approach to adjusting the observed-to-expected ratio for potential confounders originally proposed by DuMouchel (1999). The adjusted observed-to-expected ratio is an average of stratum specific observed-to-expected ratios weighted by the stratum specific expected numbers of reports: OE ad j = ∑z Ozxy z Exy z · Exy z ∑z Exy Oxy = z ∑z Exy (4.1) The relative merits of the adjusted observed-to-expected ratio, and the general impact of confounding on disproportionality analysis in the WHO database are further discussed in III. The extension of the IC to higher order associations in II is an important step towards being able to screen for disproportional reporting indicative of effect modification (for example variation across age groups in the risk of a certain ADR due to a particular drug). The higher order IC is simple to estimate and robust to overfitting based on limited amounts of data. Disregarding shrinkage, the third order IC between events x, y and z proposed in II is: ICxyz = ICxy|z − ICxy where: ICxy|z = log2 P(y | x, z) P(y | z) (4.2) (4.3) Thus, a positive third order IC value indicates that the presence of event z increases the disproportionality between x and y (and vice versa – the measure is symmetrical in x, y and z). We further show that the third order IC can be expressed as an observed-to-expected ratio for the threeway relative reporting 29 rate, where the expected relative reporting rate is based on a product of factors relating to main effects and pairwise interaction. The advantage of (4.2) relative to the third order IC proposed in Orre et al. (2000) is that (4.2) accounts for pairwise associations in the expected relative reporting rate. The discussion of how to best screen for interaction in ADR surveillance is further extended in IV. 4.3 Paper III Confounders are covariates that distort the quantitative relationship under study. A textbook example of confounding is that the crude association between coffee drinking and coronary heart disease in observational studies may be due to heavy coffee drinkers also having a greater propensity to smoke (Hennekens et al. 1976). While in experimental studies, randomisation in principle eliminates the impact of all potential confounders, observational studies are non-randomised by design and require each suspected confounder to be individually identified and adjusted for in the analysis (or accounted for in the study design). The potential impact of unaccounted for confounding variables is a constant concern in the interpretation of observational data. It has been argued that routine adjustment for potential confounders is crucial also in first pass screening for excessive ADR reporting rates in collections of individual case safety reports to avoid highlighting disproportional reporting rates driven by other covariates. In III, we study the relative merits of the post-stratification approach to routine adjustment of the observed-to-expected ratio adopted for the IC in II. We focus on the WHO database, but use both simulated stratification and stratification based on true covariates (in particular patient age, patient gender, country of origin and time of reporting). The two main results are that the adjusted observed-to-expected ratio is sensitive to over-stratification and that routine adjustment for common potential confounders has less impact on signal detection performance than initially believed. With a careful selection of suspected confounders and a more coarse categorisation of these covariates, routine adjustment does improve performance relative to a literature comparison, in our investigation. However, this performance improvement is modest compared to that due to imposing a triage criterion that requires reports from more than one country to highlight a drug–ADR pair for clinical review (see Figure 4.3). These results support the claim by Bate et al. (2003), that confounding may be less important a bias in first pass screening of collections of individual case safety reports for excessive ADR reporting rates than generally assumed. 30 0.7 Crude without triage Adjusted without triage Crude with triage Adjusted with triage 0.6 Precision 0.5 0.4 0.3 0.2 0.1 0 0.2 0.4 0.6 0.8 1 Recall Figure 4.3: Precision–recall graphs relative to a literature reference for the crude IC025 and the IC025 simultaneously adjusted for country of origin and reporting time interval, with and without a triage criterion requiring reports from more than 1 country. The graphs plot precision (number of true positives over number of true positives and false positives) vs recall (number of true positives over number of true positives and false negatives) at varying thresholds on IC025 . 4.4 Paper IV Interaction between drug substances may lead to excessive risk of certain ADRs when two drugs are taken at the same time. If previously unknown high risk drug combinations can be identified, they can potentially be avoided in the future, and drugs that would have otherwise been withdrawn can remain on the market with warnings concerning co-medication. Thus, the identification of suspected drug-drug interaction is important both from the individual patient safety perspective and from the general public health perspective. In addition to the higher order IC measure of disproportionality in II, two regression based approaches to screening individual case safety reports for suspected drug–drug interaction have also been proposed (van Puijenbroek et al. 1999, DuMouchel and Pregibon 2001), but no publicly available results indicate that any of the proposed methods have been successfully applied to prospective screening for suspected drug–drug interaction. An important contribution of IV is the observation that the limited success of earlier proposed methods for drug–drug interaction detection may be due to their use of baseline models where, in the absence of interaction, different risk factors essentially multiply. There are arguments from both public health and individual patient safety perspectives to consider absolute differences in risk rather than relative ones (Rothman et al. 1980): from a public health perspec31 tive, we are interested in whether the absolute number of ADR incidents of a certain type in a given population depends on to what extent two different drugs are co-prescribed; from the individual decision-making point of view, we want to know whether the increase in absolute risk of a certain ADR due to the prescription of one drug is modified by the co-prescription of another drug. Based on these arguments, we propose in IV the Ω measure of suspected drug–drug interaction. Ω is a shrinkage measure of threeway disproportional reporting, based on the logarithm of an observed-to-expected ratio for the relative reporting rate of an ADR A under co-prescription of drugs D1 and D2 . In our model, the background risk of A and the risks of A attributable to D1 and D2 , respectively, are independent. For small attributable risks, this leads to an approximately additive model for risk difference, in the population. The main technical contribution is an approach to estimate the expected relative reporting rate of A given D1 and D2 co-prescribed. In studies of the WHO database, we show that Ω highlights examples of established drug–drug interaction, with excessive relative reporting rates that go undetected with logistic regression. For example, unlike logistic regression Ω indicates that there is suspected interaction between gemfibrozil and cerivastatin with respect to the risk of rhabdomyolysis. This is a well established drug–drug interaction and co-prescription together with gemfibrozil was contraindicated for cerivastatin even as it was introduced to the general public. There are over 1,000 reports in the WHO database on rhabdomyolysis for concomitant use of cerivastatin and gemfibrozil, and the relative reporting rate of rhabdomyolysis given cerivastatin together with gemfibrozil is over 75%. This is to be compared with relative reporting rates of 0.1% in the absence of both cerivastatin and gemfibrozil, 4% for gemfibrozil in the absence of cerivastatin and 25% for cerivastatin in the absence of gemfibrozil. Clearly, a method for drug–drug interaction must highlight this as indicative of suspected drug–drug interaction in order to be practically useful for first pass screening purposes. Ω fulfils this requirement and allows for computationally efficient first pass screening for suspected drug–drug interaction in collections of individual case safety reports. 4.5 Paper V The aim of V is to demonstrate the usefulness of case-based imprecision estimates for Bayes classifier predictions. Unlike the overall expected prediction error, case-based precision estimates indicate the certainty with which each individual data point is predicted. Clearly, this will vary depending on the degree of similarity between the data point of interest and those in training data. Bayes classifiers are generative classifiers that predict class membership indirectly, based on estimated distributions of the explanatory variables given 32 a specific class. This is in contrast with discriminative classifiers, such as logistic regression whose parameters are optimised directly with respect to the prediction performance on a given set of training data. As noted by Ng and Jordan (2002), generative classifiers may reach their (higher) asymptotic error more rapidly than discriminative classifiers such as logistic regression, and thus be preferable in the absence of large amounts of training data. Despite its often violated assumption of mutual independence between explanatory variables given class membership, the naive Bayes classifier has proved to compare well with more sophisticated classification methods in many real world applications (Domingos and Pazzani 1997, Hand and Yu 2001). However, the exact values of the estimated class probabilities are not trustworthy as there is a tendency of the naive Bayes classifier of being too confident in its predictions (Hand and Yu 2001). As an alternative, we propose that the certainty with which each data point is classified be estimated based on Bayesian bootstrap resampling of the original training data. The Bayesian bootstrap produces a large number of slightly modified training data sets by repeatedly assigning Di(1, 1, . . . , 1) distributed random weights to the observations in the original training data. Based on each Bayesian bootstrap replicate of the original training data, a Bayes classifier is trained and used to predict the data point(s) of interest. Instead of the predicted probability of class membership based on the original Bayes classifier, we propose that the proportion of Bayesian bootstrap replicates for which the predicted probability of class membership exceeds 0.5 be used as an estimate of the certainty with which a given data point is predicted. We provide results in V that indicate that this reduces the expected loss, when some misclassifications are more costly than others. A comment made in Norén (2005), which is worth repeating, is that in V, the marginal class probabilities P(y j ) were estimated essentially as the proportion of instances from each class in the available training data. This is appropriate when training data is a representative sample from the population to which the classifier is to be applied. If, on the other hand, the composition of training data does not necessarily represent that of future observations, then P(y j ) must either be estimated based on external data relevant to the population of interest or be based on prior assumptions. The Bayesian bootstrap can be modified to accommodate this, by replacing the numbers of data points from each class in training data {ny1 , ny2 , . . .} (see Table 1 in V) by the corresponding numbers in a data set representative for future samples (or by appropriate pseudo-counts). V is the only paper in this thesis not to have derived methods explicitly for the purpose of improved ADR surveillance. However, the methods for improved Bayes classification under asymmetrical loss have a potential application in the development of more data driven triage algorithms for ADR surveillance. As implemented today, the triage algorithms are based exclusively on clinical expertise (Ståhl et al. 2004). A Bayes classifier framework may allow for a more data driven approach where clinical judgement of the value of previ33 ously highlighted drug–ADR pairs is used as training data for the construction of a Bayes classifier. Useful explanatory variables for such an implementation might include the total number of reports listing a given drug–ADR combination, as well as their quality, diversity and geographical spread; the number of positive de- or rechallenge interventions etc. Given that in ADR signal detection, missed problems are more problematic than falsely highlighted ones, the loss functions involved will be asymmetrical, and the Bayesian bootstrap method proposed in V should allow for improved performance. 34 Acknowledgements I would like to express my gratitude to all those who have provided support and encouragement during the work that has lead to this PhD thesis. • Professor Rolf Sundberg for inspiration and advice, and for providing an excellent example of how to combine a profound knowledge of mathematical statistics with a genuine interest in solving real world problems. • Professor Ralph Edwards for helping me to broaden my views in pharmacovigilance, for identifying a wide variety of challenging research problems and for providing an ambitious overall vision for the work of the Uppsala Monitoring Centre. • Andrew Bate for encouraging me to enrol as a PhD student, for day to day support and advice and for a very rewarding and productive collaboration. • Marie Lindquist, Ron Meyboom and Sten Olsson for sharing their knowledge of ADR signal detection and of the WHO programme. • All colleagues at the Uppsala Monitoring Centre for providing an excellent work environment, in particular the members of the R&D team, past and present: Erik Swahn, Jonathan Edwards, Malin Ståhl, Sven Purbe, Johan Hopstadius, Kristina Star, Johanna Strandell and Ola Caster. • Roland Orre for expert computational support and advice. • All members of the Division of Mathematical Statistics at Stockholm University for providing a stimulating research environment and for welcoming me to the group. Finally, I would like to thank all my family and friends from Järbo, New Bethlehem, Göteborg, Uppsala and elsewhere. A special thank you to my parents Hasse and Christina Norén for your everlasting support and encouragement, and to Minna, my love, for making it all worthwhile. Uppsala, March 2007, Niklas Norén 35 Bibliography Agrawal, R., Mannila, H., Srikant, R., Toivonen, H. and Verkamo, A. I.: 1996, Fast discovery of association rules, Advances in knowledge discovery and data mining, American Association for Artificial Intelligence, pp. 307–328. Aronson, J. K. and Hauben, M.: 2006, Anecdotes that provide definitive evidence, British Medical Journal 333(7581), 1267–1269. Bate, A.: 2003, The Use of a Bayesian Confidence Propagation Neural Network in Pharmacovigilance, PhD thesis, Umeå University. Bate, A., Edwards, I. R., Lindquist, M. and Orre, R.: 2003, Violation of homogeneity: the author’s reply, Drug Safety 26, 364–366. Bate, A., Lindquist, M., Edwards, I. R., Olsson, S., Orre, R., Lansner, A. and De Freitas, R. M.: 1998, A Bayesian neural network method for adverse drug reaction signal generation, European Journal of Clinical Pharmacology 54, 315– 321. Berry, S. M. and Berry, D. A.: 2004, Accounting for multiplicities in assessing drug safety: a three-level hierarchical mixture model, Biometrics 60(2), 418–426. Bingham, E., Mannila, H. and Seppänen, J. K.: 2002, Topics in 0–1 data, KDD’02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM Press, New York, NY, USA, pp. 450–455. Breiman, L.: 1985, Nail finders, edifices and Oz, Proceedings of the Berkeley Conference in Honor of Jerzy Neyman and Jack Kiefer, Vol. I, Wadsworth, Belmont, CA, USA, pp. 201–214. Breiman, L.: 2001, Statistical modeling: the two cultures (with comments and a rejoinder by the author), Statistical science 16(3), 199–231. Copas, J. and Hilton, F.: 1990, Record linkage: statistical models for matching computer records, Journal of the Royal Statistical Society: Series A 153(3), 287–320. Coulter, D. M., Bate, A., Meyboom, R. H., Lindquist, M. and Edwards, I. R.: 2001, Antipsychotic drugs and heart muscle disorder in international pharmacovigilance: data mining study, British Medical Journal 322(7296), 1207–1209. 37 De Veaux, R. D. and Hand, D. J.: 2005, How to lie with bad data, Statistical Science 20(3), 231–238. Domingos, P. and Pazzani, M.: 1997, On the optimatility of the simple Bayesian classifier under zero-one loss, Machine learning 29, 103–130. DuMouchel, W.: 1999, Bayesian data mining in large frequency tables, with an application to the FDA spontaneous reporting system, American Statistician 53, 177–202. DuMouchel, W. and Pregibon, D.: 2001, Empirical Bayes screening for multi-item associations, KDD ’01: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 67–76. Edwards, I. R.: 1997, Adverse drug reactions: finding the needle in the haystack, British Medical Journal 315(7107), 500. Edwards, I. R.: 1999, Spontaneous reporting – of what? Clinical concerns about drugs, British Journal of Clinical Pharmacology 48(2), 138–141. Edwards, I. R. and Aronson, J. K.: 2000, Adverse drug reactions: definitions, diagnosis and management, Lancet 356(9237), 1255–1259. Edwards, I. R. and Biriell, C.: 1994, Harmonisation in pharmacovigilance, Drug Safety 10(2), 93–102. Edwards, I. R., Wiholm, B.-E., Lindquist, M. and Napke, E.: 1990, Quality criteria for early signals of possible adverse drug reactions, Lancet 336(8708), 156–158. Efron, B.: 2001, [Statistical modeling: the two cultures]: Comment, Statistical science 16(3), 218–219. Egberts, A. C., Meyboom, R. H. and van Puijenbroek, E. P.: 2002, Use of measures of disproportionality in pharmacovigilance: three Dutch examples, Drug Safety 25(6), 453–458. Elder, J. F. and Pregibon, D.: 1996, A statistical perspective on knowledge discovery in databases, Advances in knowledge discovery and data mining, American Association for Artificial Intelligence, Menlo Park, CA, USA, pp. 83–113. Evans, S. J. W.: 2000, Pharmacovigilance: a science or fielding emergencies?, Statistics in Medicine 19(23), 3199–3209. Evans, S. J. W.: 2004, Statistics: analysis and presentation of safety data, in J. Talbott and P. Waller (eds), Stephens’ detection of new adverse drug reactions, John Wiley & Sons, Chichester, England, pp. 301–328. Evans, S. J. W., Waller, P. C. and Davis, S.: 2001, Use of proportional reporting ratios (PRRs) for signal generation from spontaneous adverse drug reaction reports, Pharmacoepidemiology and Drug Safety 10(6), 483–486. 38 Fayyad, U., Piatetsky-Shapiro, G. and Smyth, P.: 1996, The KDD process for extracting useful knowledge from volumes of data, Communications of the ACM 39(11), 27–34. Finney, D. J.: 1966, Monitoring adverse reactions to drugs – its logic and its weaknesses, Proceedings of the European Society for the study of Drug Toxicity 7, 198–207. Finney, D. J.: 1971, Statistical logic in the monitoring of reactions to therapeutic drugs, Methods of Information in Medicine 10(4), 237–245. Finney, D. J.: 1973, The detection of causation of adverse events, Proceedings of the 39th session of the International Statistical Institute, pp. 387–393. Finney, D. J.: 1974, Systematic signalling of adverse reactions to drugs, Methods of Information in Medicine 13(1), 1–10. Glymour, C., Madigan, D., Pregibon, D. and Smyth, P.: 1997, Statistical themes and lessons for data mining, Data Min. Knowl. Discov. 1(1), 11–28. Hand, D. J.: 1994, Deconstructing statistical questions, Journal of the Royal Statistical Society. Series A (Statistics in Society) 157(3), 317–356. Hand, D. J.: 1998, Data mining: Statistics and more?, The American Statistician 52, 112–118. Hand, D. J. and Bolton, R.: 2004, Pattern discovery and detection: A unified statistical methodology, Journal of Applied Statistics 31(8), 885–924. Hand, D. J. and Yu, K.: 2001, Idiot’s Bayes—not so stupid after all?, International Statistical Review 69(3), 385–398. Hastie, T., Tibshirani, R. and Friedman, J.: 2001, The elements of statistical learning: data mining, inference and prediction, Springer-Verlag, New York, NY, USA. Hauben, M., Madigan, D., Gerrits, C. M., Walsh, L. and van Puijenbroek, E. P.: 2005, The role of data mining in pharmacovigilance, Expert Opinion on Drug Safety 4(5), 929–948. Hennekens, C., Drolette, M., Jesse, M., Davies, J. and Hutchison, G.: 1976, Coffee drinking and death due to coronary heart disease, New England Journal of Medicine 294(12), 633–636. Hopstadius, J.: 2006, Methods to control for confounding variables in screening for associations in the WHO drug safety database, Master’s thesis, Uppsala University. 39 Kim, W. Y., Choi, B.-J., Hong, E. K., Kim, S.-K. and Lee, D.: 2003, A taxonomy of dirty data., Data Mining and Knowledge Discovery 7(1), 81–99. Lindquist, M.: 2003, Seeing and Observing in International Pharmacovigilance – Achievements and Prospects in Worldwide Drug Safety, PhD thesis, Katholieke Universiteit Nijmegen. Mannila, H.: 1996, Data mining: machine learning, statistics, and databases, Pro- ceedings of the 8th International Conference on Scientific and Statistical Database Management (SSDBM ’96), pp. 2–9. Meyboom, R. H. B., Egberts, A. C. G., Edwards, I. R., Hekster, Y. A., de Koning, F. H. P. and Gribnau, F. W. J.: 1997, Principles of signal detection in pharmacovigilance, Drug Safety 16(6), 355–365. Meyboom, R. H. B., Lindquist, M., Egberts, A. C. G. and Edwards, I. R.: 2002, Signal selection and follow-up in pharmacovigilance, Drug Safety 25(6), 459–465. Ng, A. Y. and Jordan, M. I.: 2002, On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes, in T. G. Dietterich, S. Becker and Z. Ghahramani (eds), Advances in Neural Information Processing Systems 14, MIT Press, Cambridge, MA. Norén, G. N.: 2005, Statistical methods for large scale exploratory analysis of postmarketing drug safety data. Licentiate thesis, Stockholm University. Olsson, S.: 1998, The role of the WHO programme on international drug monitoring in coordinating worldwide drug safety efforts, Drug Safety 19(1), 1–10. Orre, R., Bate, A., Norén, G. N., Swahn, E., Arnborg, S. and Edwards, I. R.: 2005, A Bayesian recurrent neural network for unsupervised pattern recognition in large incomplete data sets, International Journal of Neural Systems 15(3), 207– 222. Orre, R., Lansner, A., Bate, A. and Lindquist, M.: 2000, Bayesian neural networks with confidence estimations applied to data mining, Computational Statistics & Data Analysis 34, 473–493. Patrikainen, A. and Mannila, H.: 2004, Subspace clustering of high dimensional binary data – a probabilistic approach, Proc. Fourth SIAM Int’l Conf. Data Mining, Workshop Clustering High Dimensional Data and Its Applications, pp. 57–65. Patwary, K. M.: 1969, Report on statistical aspects of the pilot research project for international drug monitoring, Technical report, Report prepared for the World Health Organization, Geneva. Purcell, P. M.: 2003, Data mining in pharmacovigilance, International Journal of Pharmaceutical Medicine 17(2), 63–64. 40 Rawlins, M. D.: 1988, Spontaneous reporting of adverse drug reactions. II: Uses, British Journal of Clinical Pharmacology 1(26), 7–11. Rothman, K. J., Greenland, S. and Walker, A. M.: 1980, Concepts of interaction, American Journal of Epidemiology 112(4), 467–470. Sanz, E. J., De-las-Cuevas, C., Kiuru, A., Bate, A. and Edwards, I. R.: 2005, Selective serotonin reuptake inhibitors in pregnant women and neonatal withdrawal syndrome: a database analysis, The Lancet 365, 482–487. Savage, R. L.: 1985, Adverse drug reaction monitoring, Master’s thesis, University of Newcastle upon Tyne. Silverstein, C., Brin, S. and Motwani, R.: 1998, Beyond market baskets: generalizing association rules to dependence rules, Data mining and Knowledge Discovery 2, 39–68. Ståhl, M., Lindquist, M., Edwards, I. R. and Brown, E. G.: 2004, Introducing triage logic as a new strategy for the detection of signals in the WHO drug monitoring database, Drug Safety 13(6), 355–363. van Puijenbroek, E. P., Bate, A., Leufkens, H. G. M., Lindquist, M., Orre, R. and Egberts, A. C. G.: 2002, A comparison of measures of disproportionality for signal detection in spontaneous reporting systems for adverse drug reactions, Pharmacoepidemiology and Drug Safety 11(1), 3–10. van Puijenbroek, E. P., Egberts, A. C., Meyboom, R. H. B. and Leufkens, H. G. M.: 1999, Signalling possible drug-drug interactions in a spontaneous reporting system: delay of withdrawal bleeding during concomitant use of oral contraceptives and itraconazole, British Journal of Clinical Pharmacology 47, 689–693. Webb, A.: 2002, Statistical pattern recognition, 2 edn, John Wiley & Sons, Chichester, England. 41