Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
374 The value of computer-administered self-report data in central nervous system clinical trials Bill Byrom1* & James C Mundt2 Addresses 1 ClinPhone Group Ltd Lady Bay House Meadow Grove Nottingham NG2 3HF UK Email: [email protected] 2 Healthcare Technology Systems Inc 7617 Mineral Point Road Suite 300 Madison WI 53717 USA *To whom correspondence should be addressed Current Opinion in Drug Discovery & Development 2005 8(3):374-383 © The Thomson Corporation ISSN 1367-6733 Over the past decade, significant research has been conducted into the development of computer-delivered interviews and scoring methods to assess disease severity in a number of central nervous system indications. In particular, equivalence has been demonstrated between gold standard assessments (such as the Hamilton depression and Hamilton anxiety rating scales) delivered either by computer or by clinicians, in a number of clinical studies. Such computer-based approaches have a number of advantages including the elimination of inter-rater variance and prevention of inclusion bias when inclusion criteria are based upon baseline ratings. In addition, some studies have demonstrated that computer-administered ratings can be more sensitive to the detection of treatment-related changes than clinician ratings. Much of this evidence has led to recent US Food and Drug Administration acceptance of certain computer-administered assessments as primary endpoints in depression clinical trials. Keywords Anxiety clinical trials, baseline score inflation, computerized clinical assessments, depression clinical trials, interactive voice response, placebo effect Abbreviations emphasis on depression and anxiety. Key validation studies are discussed, demonstrating the equivalence of computerized assessments to clinician-made subjective assessments, with particular focus on the Hamilton depression (HAMD) and Hamilton anxiety (HAMA) rating interviews. The results of studies performed over the last five years using these techniques are reported, highlighting their particular utility in limiting the potential for inclusion bias in CNS clinical trials and providing particularly sensitive measures of change. Commentary is also given on the regulatory acceptance of this methodology and the direction of future research in this area. Acceptability of self-report data In recent years, patient-reported outcomes have become increasingly used and accepted as important endpoints in clinical drug submissions, and a number of successful drug approvals have been achieved using patient self-report data. For example, sertraline for premenstrual dysphoric disorder, paroxetine for social anxiety disorder, topiramate for migraine/headache and zalephon for sleep disorders, were all approved using data reported and collected by patients. There are a multitude of such examples, but more recently, there has been an increasing trend toward collecting such data using electronic means rather than paper diaries and questionnaires. The US Food and Drug Administration (FDA) approved eszopiclone for the treatment of insomnia based upon primary sleep-diary data collected using a telephone-based patient diary (via interactive voice response (IVR)). In addition, the approved FDA submission for ketorolac tromethamine as a treatment for ocular pain included patient self-report data collected using a personal digital assistant (PDA), and the prescribing information for the estradiol/levonorgestrel transdermal system Climara Pro, a hormonal transdermal patch for the treatment of postmenopausal symptoms, included patient-reported bleeding/spotting data collected using IVR during a oneyear clinical trial. Introduction The European Medicines Agency (EMEA) makes direct reference to the use of electronic means for the collection of patient self-report data in two of their clinical investigation guidance documents, for asthma [1] and oral contraceptives [2]. More significantly, in August 2004 the FDA stated that it would accept self-report data in general, as an approach to assessing symptom severity in clinical drug trials focusing on major depressive disorder in outpatients [3••]. The agency identified as acceptable approaches HAMD performed using a validated computer interview delivered by IVR, and the inventory of depressive symptoms (IDS) [4] and quick IDS (QIDS), in their various forms. This is particularly significant, as primary endpoints in depression trials have previously been exclusively based upon subjective ratings performed by the clinician. This article reviews the development and validation of computerized clinical assessments for use in clinical trials of drugs affecting the central nervous system (CNS), with In the remainder of this review, the use of computeradministered self-report data in CNS clinical trials and CNS HAMA HAMD IDS IVR LSAS MADRS PDA QIDS Y-BOCS Central nervous system Hamilton anxiety rating scale Hamilton depression rating scale Inventory of depressive symptoms Interactive voice response Leibowitz social anxiety scale Montgomery-Åsberg depression rating scale Personal digital assistant Quick inventory of depressive symptoms Yale-Brown obsessive-compulsive disorder scale Computer-administered self-report data in CNS clinical trials Byrom & Mundt 375 evidence of the validity of measuring clinical endpoints in this way, are explored. Rationale for the electronic application of selfreport instruments There is a growing body of evidence in the literature illustrating the limitations of using paper diaries and questionnaires in recording patient-reported outcomes. Paper instruments are subject to missing data, conflicting entries, ambiguous completion and invented values [5,6]. One methodology study even detected patients completing their paper diary entries prior to the actual days for which the entries were 'reported' [7•]. Electronic solutions, whether delivered using a personal computer, PDA or IVR, offer a number of advantages over paper approaches including in-built logic checking, automated questionnaire branching, time and date stamped entries and defined entry windows to prohibit retrospective or prospective entries. These features mean that patientreported data collected electronically have greater quality and integrity than paper-recorded data. However, many of the instruments discussed in this review are not simply patient diaries converted from a paper version to computerized delivery, but are validated computerized clinical assessments, comprising a computer-delivered interview and scoring algorithm, that can replace the appraisals normally conducted by a clinician. Factors contributing to failed CNS trials The main challenge facing clinical trials in many CNS indications is in designing studies that provide the best opportunity to demonstrate true drug efficacy compared with placebo. High-placebo response rates in depression and anxiety studies are suspected to be a key reason for failure to detect significant separation of active treatment arms from placebo. In an analysis of new drug applications approved by the FDA between 1985 and 1997, Khan et al reported that only 47.7% of adequate-dose new antidepressant treatment arms demonstrated statistically significant separation from placebo, with 20.5% being supportive and 31.8% equivocal (ie, no significant difference from placebo) [8•]. In a similar analysis of anxiolytic drug approvals by the FDA between 1985 and 2000, it was reported that only 48% of adequatedose new drug treatment arms demonstrated statistically significant separation from placebo [9•]. These results do not include development programs that may have been unnecessarily halted due to poor results prior to regulatory submission. There may be many causes for the high proportion of failed clinical trials in depression, but baseline disease severity has been identified as a prominent factor [10•] - the greater the severity of the baseline depression, the greater the active treatment effects are, compared with placebo. In this analysis, flexible dosing designs and the inclusion of fewer treatment arms and fewer female patients were also demonstrated to improve treatment-related differences. Patients entering depression clinical trials, and also clinical trials for CNS and other disorders, generally improve on placebo. Optimal study design may minimize placebo response and thereby reduce the likelihood of study failure. Computerized administration of clinical assessments has been suggested as one method for accomplishing this aim [11,12•]. Subjective investigator ratings have formed the basis of primary efficacy endpoints in clinical drug trials in depression, anxiety, schizophrenia and many other CNS disorders. Assessments are made following semi-structured interviews, and clinicians rate the severity of symptoms exhibited by patients on an ordinal scale against a number of individual scale items. Item scores, subscale scores and total scores form the basis of clinical measurements. Gold standard instruments include: (i) HAMD [13], the Montgomery-Åsberg depression rating scale (MADRS) [14], and the long and short forms of the IDS [4] for depression; (ii) HAMA [15] and the Leibowitz social anxiety scale (LSAS) [16] for anxiety; (iii) the Yale-Brown obsessivecompulsive scale (Y-BOCS) [17] for obsessive-compulsive disorder; and (iv) the brief psychiatric rating scale (BPRS) [18], the scale for assessment of negative symptoms (SANS) [19] and the positive and negative symptoms scale (PANSS) [20] for schizophrenia. These instruments are valuable in determining both the baseline disease severity and in measuring treatment-related changes. Inter-rater differences in assessment reliability contribute to measurement error, and consequently to the inability to separate treatment arms. Regulatory guidance requires study sponsors to measure and control the effects of interrater variability by conducting pre-study training to standardize ratings [21]. Demitrack et al reported that such training has minimal effects [22•]. In their study, the difference between minimum and maximum total scores of 86 raters, obtained during a training event that consisted of four recorded Hamilton interviews, varied from 14 to 21 points. This variability is significant relative to the range of scores likely in a depressed population - where most patients would be scored between 10 and 35. In addition, the time spent by investigators performing patient interviews, such as for HAMD, is often far less than the recommended 25 to 30 min; in one report 35% of interviews were conducted in < 10 min [23]. These limitations can be addressed by validated computer interviews, which deliver completely standardized assessments on each application. The validity of computerized assessments compared with clinician ratings There is a wealth of literature reporting the equivalence of electronic delivery of assessment instruments to conventional pen and paper versions. However, our focus in this review is the use of computerized clinical assessments as an alternative or adjunct to human assessments (in particular clinician subjective assessments). Millard and Carver reported a comparison of quality-of-life data collected via a live telephone interview or collected using an automated interview delivered via IVR [24]. This study employed the SF-12 quality-of-life instrument for 229 individuals suffering from mild back pain. In their study, patients were randomly assigned to one of the assessment modalities and assessed only once. Despite methodological limitations, the study concluded that both human and 376 Current Opinion in Drug Discovery & Development 2005 Vol 8 No 3 computerized approaches were equivalent when assessing the physical component summary scale of the quality-of-life instrument. However, the researchers concluded that patients were less reserved in reporting emotional concerns and mood to a computer compared with a human. This finding concurs with other studies. For example, Kobak et al reported that IVR had a greater sensitivity in detecting alcohol problems [25•], and Turner et al indicated a higher prevalence of reporting of sexual behaviors and drug abuse among adolescents when audio-assisted computer interviewing was used [26•]. There is evidence that in comparison to face-to-face interviews, patients may exhibit increased honesty when responding to sensitive, emotional or embarrassing questions using computer assessments. Turning to more complex clinical assessments used in CNS studies, such as the HAMD, MADRS and HAMA, much research has been conducted into translating the semistructured interviews and item scoring performed by trained clinicians, into fully structured computerized interviews with non-linear scoring algorithms to produce equivalent measures. Considerable data has been collected over more than ten years to support the clinical utility of this approach. Significantly, Kobak et al provided a summary of validation evidence for a computer interview version of the HAMD from ten comparative studies performed from 1990 to 1999 [27••]. In the studies reported, computer interviews were delivered using PC and IVR technology. The IVR technology used recorded prompts to question the individual, who then selected response options using the telephone touchtone keypad. This simple interface is especially convenient, enabling assessment from the home as well as the clinic, and largely overcomes issues of functional illiteracy that might limit the ability to deliver complex clinical assessments via other electronic media. It is not uncommon, for example, for researchers to read out site-based questionnaires to patients unable to understand the written version. There have been few examples for the delivery of complex CNS clinical assessments using any other electronic solution in place of clinicians, and much of the review that follows refers exclusively to instruments delivered via IVR technology. In their review, Kobak et al reported a high correlation between clinician and computer HAMD scores (n = 1791, r = 0.81, p < 0.001) [27••]. Considering all 17 scale items, internal consistency measures of scale reliability [28] supported the use of the computer-administered HAMD. Across ten studies, Cronbach's α ranged from 0.60 to 0.91 for the computerized HAMD, compared to a range of -0.41 to 0.91 for clinician ratings. In eight out of ten studies, Cronbach's α for computer rating exceeded that for clinician rating. Testretest reliability was also reported for scores obtained by computer interview in the same range as clinician test-retest correlations. Because of the high number of questions delivered during computer interview, it is likely that the computerized assessment approach is less prone to memory artificially inflating correlations between assessments repeated over a short time period, compared with clinician memory of individual item ratings. In a study comparing clinician and computerized HAMA ratings in patients with a diagnosed anxiety disorder (n = 86) or major depression (n = 128) and healthy controls (n = 78) [29•], the computer-administered HAMA demonstrated good internal consistency (Cronbach's α = 0.90) and test-retest reliability (r = 0.96), and a correlation of 0.92 (p < 0.001) was found between computer and clinician versions. No significant difference between mean HAMA scores obtained by computer or clinician was observed among the patients diagnosed with anxiety disorders, although a small significant difference in total scores was detected for the sample overall, which was not clinically relevant as it was within the order of the test-retest variability. A more recent study of 72 patients, comparing computer (IVR) to clinician HAMA, reported no significant differences in scores obtained using either method (mean scores = 15.53 and 16.13 for IVR and clinician, respectively), and similar measures of internal consistency and test-retest reliability were observed as in the earlier studies [30]. Additional validity was evidenced by good concordance between IVR HAMA scores and Beck anxiety inventory assessments [31] of the same patients. More recently, research has been conducted into translating the semi-structured clinician interview and rating using the MADRS for assessment of depression severity, into an IVR interview and scoring algorithm [32••]. Depressed subjects (n = 60) were assessed by clinician and IVR in random order, and the two sets of resulting scores were not significantly different. Scale reliability measures, ie, internal consistency and test-retest reliability, were comparable between the methods. Although promising, the study was conducted in a relatively small sample of patients and further research is required to investigate the sensitivity of the IVR MADRS to clinical change over the time associated with different treatment regimens. Other studies have examined and demonstrated good equivalence between electronic- and clinician-rated scores for the Y-BOCS [33] and LSAS [34]. These instruments, however, are essentially checklists, which makes the equivalence between clinician and self-report less surprising. The different types of assessment relevant to CNS studies that have been adapted and used via IVR are listed in Table 1. Validated computer assessments may offer many advantages to researchers. For example, the standardized interview delivered via computer eliminates the inter-rater variability observed in multicenter clinical trials. As discussed later in this review, it can also eliminate the sources of rater bias, such as inclusion bias and bias introduced by side-effect profile unblinding. The ability to perform clinical assessments from home rather than in the clinic, using remote electronic solutions such as IVR or PDA, provides many benefits to study design. Efficacy data may be collected more frequently to investigate, for example, speed of onset, and data may be collected in study phases in which regular clinical assessments are not convenient, such as during long-term safety extensions. Computer-administered self-report data in CNS clinical trials Byrom & Mundt 377 Table 1. Assessments that have IVR adaptations relevant to CNS clinical trials. • • • • • • • • • • • • • • • • • Brief social phobia scale (BSPS) Changes in sexual functioning questionnaire (CSFQ) Cognitive function test battery (CFTB; Cognitive Drug Research Ltd/ClinPhone Group Ltd) Daily telephone assessment (DTA) - for depression Davidson trauma scale (DTS©) HAMA HAMD LSAS MADRS Memory-enhanced retrospective evaluation of treatment (MERET®) Mental heath screener (MHS) Quality-of-life enjoyment and satisfaction questionnaire (Q-LES-Q) SF-12, SF-20 and SF-36 quality-of-life scales Symptoms of dementia screener (SDS) Telephonic remote evaluation of neuropsychological deficits (TREND) Work and social adjustment scale (WSAS) Y-BOCS A final aspect of the validity of these techniques involves assessment of whether these computer-administered ratings are practical for use in clinical trials. It should be noted that they in no way replace the standard requirements for a clinician-patient meeting. Computer-administered ratings can, however, reduce the time required for lengthy rating interviews, permitting more time for clinicians to focus on the care and treatment of the patient. A study of social anxiety disorder patients (n = 874) using IVR versions of the HAMD and LSAS [35] reported that 90% of patients rated computer assessment as 'very easy' or 'easy' [12•]. Similar results were reported in another study using IVR HAMD assessment (Figure 1), in which 76% of patients (n = 74) agreed that the questions were clearly stated and understandable (12% had no opinion), 93% agreed that the assessment was easy to complete (4% had no opinion), 79% found the IVR assessments convenient (12% had no opinion), and 48% indicated that the computer assessment allowed them to express their true feelings (30% had no opinion) [36•]. Baseline assessment of disease severity Most CNS clinical trials using clinician-assessed subjective rating scales as endpoints require patients to have a baseline rating above or below a specified threshold to ensure that only those of an appropriate disease severity are included in the study. For example, depression studies using the HAMD endpoint commonly require a baseline HAMD score ≥ 20 as an inclusion criterion. Due to the considerable enrollment pressure on clinicians, and the fact that compensation is normally contingent upon randomization, conscious or subconscious bias can be introduced into baseline ratings to satisfy inclusion criteria. Inflation of scores at baseline introduces treatment-independent improvements postbaseline, which contribute to the placebo effect and Figure 1. Patient acceptability ratings using IVR HAMD in depressed patients. 100% 100 strongly Strongly disagree disagree disagree Disagree no Noopinion opinion agree Agree strongly Strongly agree agree % of patients 80% 80 60% 60 40% 40 20% 20 0% 0 Understandable Understandable questions questions to EasyEasy to complete complete This figure is based on data from reference [36•]. Convenient Convenient Express Express true true feelings feelings 378 Current Opinion in Drug Discovery & Development 2005 Vol 8 No 3 exacerbate the problem of demonstrating treatment-related differences. A number of studies presented at scientific meetings have implicated this phenomenon by comparing ratings performed by clinicians with those produced by computer assessment. In one such study, reported by DeBrota et al (Figure 2) [37••], the researchers not only compared clinician HAMD ratings with computer scores collected using IVR at each of 8 weekly treatment visits, but also blinded the investigators to the true start of doubleblind treatment by including a variable duration phase of placebo treatment after randomization without the knowledge of the investigators. At baseline, the researchers observed that the distribution of clinician scores was not normal and was truncated at the inclusion threshold score of 20. In fact, only four of the 291 patient assessments (1.4%) made by clinicians at week 0 were below 20. In contrast, the distribution of computerized assessment scores at this time point was normal, with 38% of patients rated below 20. Clinician ratings before randomization were, on average, 2.2 points higher than those obtained via IVR (p < 0.001). Following randomization, the concordance between clinician and computer ratings returned. During the final four visits after treatment assignment the correlation between methods was 0.84 (p < 0.01) and the mean clinician scores were not significantly different from the IVR-derived scores (they were 0.31 points higher than the IVR scores). The large differences in the distributions of ratings at baseline and the convergence of assessment methods following randomization is suggestive of inflated clinician ratings of disease severity at baseline. Figure 2. Concordance of IVR and clinician HAMD ratings pre- and post-baseline for a study requiring baseline HAMD scores ≥ 20. 40 0 0 10 20 30 0 Visit 3 HAMD17p HAMD score (p) 10 20 30 30 40 10 20 30 Visit 9 HAMD17c HAMD score (c) 60 40 20 50 40 30 20 40 25 20 15 10 0 Number of individuals 40 30 20 0 30 10 20 30 40 Visit 3 HAMD17c HAMD score (c) 0 20 20 5 40 10 Visit 9 score HAMD17p (p) HAMD 10 12 8 6 4 2 10 HAMD score (p) Visit 9 HAMD17p 10 Placebo Visit 3 HAMD17c HAMD score (c) 0 0 0 Number of individuals 40 30 20 40 Run-in Visit 2 HAMD17c HAMD score (c) 0 30 40 10 40 10 Visit 3 score HAMD17p (p) HAMD 20 15 10 20 30 Perceived randomization Visit 2 HAMD17c HAMD score (c) 5 10 20 0 Number of individuals 40 30 20 40 10 HAMD score (c) Visit 1 HAMD17c 0 30 0 40 Earliest true randomization 10 12 20 0 Number of individuals 0 Number of individuals 40 8 10 Visit 2 HAMD17p HAMD score (p) Visit 9 Endpoint 30 0 0 Visit 3 Earliest possible baseline 20 10 Visit 2 score HAMD17p (p) HAMD 20 15 10 5 0 Number of individuals Visit 2 Perceived baseline 10 HAMD score (c) Visit 1 HAMD17c Active 6 30 4 20 HAMD score (p) Visit 1 HAMD17p 2 10 0 0 0 Number of individuals 40 30 20 10 0 5 10 15 20 Visit 1score HAMD17p (p) HAMD 25 Clinician 0 Visit 1 Screening Number of individuals IVR 0 10 20 30 40 Visit 9 HAMD17c HAMD score (c) To qualify for this study, patients required a clinician-assessed HAMD score of ≥ 20 at visits 1 and 2. It was observed that clinician scores were poorly correlated with IVR scores at these visits. Moreover, the distribution of clinician scores at visits 1 and 2 was truncated at a score of 20, illustrating the inflation of baseline scores to meet the study inclusion criteria. Concordance of IVR and clinician assessments was reestablished at later visits where study eligibility did not depend upon the HAMD ratings. This figure is based on data from reference [37••]. p Patient self-assessment (IVR), c clinician assessment. Computer-administered self-report data in CNS clinical trials Byrom & Mundt 379 In a second study, Rayamajhi et al compared clinician and IVR HAMD assessments obtained from 307 patients who participated in a double-blind placebo-controlled study for major depressive disorder [38••]. A lower baseline minimum score of 15 was required by this study and clinician HAMD scores were only an average of 1.09 points higher than IVR assessments at screening/baseline, with the agreement between assessment methods improved following randomization. In a third study, IVR HAMD assessments were used to prequalify individuals prior to clinician assessment in a study of major depressive disorder [Mundt J, unpublished results]. Patients were qualified for clinician assessment if they were rated with an IVR HAMD rating of ≥ 20 at both screening and baseline. To enter the study, patients also required a subsequent baseline clinician rating of ≥ 22. Patients (n = 582) were qualified by IVR assessment and subsequently rated by the clinicians. Overall, a total of 1593 IVR HAMDs (1011 at screening visits, 582 at baseline) were performed. Of all rated patients with HAMD scores ≥ 20, 188 had computer HAMD scores of 20 or 21. Of these, only one patient was rated below 22 by clinicians and hence disqualified from entering the study. Interestingly, clinicians were not informed that the threshold for qualification using the IVR assessments was lower than the clinician rating inclusion criterion. The fact that only one marginal IVRrated patient (scored 20 to 21) was excluded by clinicians could be evidence of confirmation bias among raters. However, it could be argued that IVR HAMDs are simply underestimating the disease severity, which is more accurately assessed by the clinician. The data from the two previous studies indicated lower IVR ratings at screening and baseline before concordance was restored between clinician and IVR ratings at subsequent assessments where inclusion/exclusion did not depend upon the ratings [37••,38••]. However, in the third study the mean IVR rating of qualified patients at baseline was significantly higher than that obtained via clinician rating (26.8 versus 25.4, respectively, p < 0.001), despite the use of a lower IVR qualification threshold, indicating that IVR does not systematically under evaluate depression severity. An opposite effect to this observation of baseline score inflation among clinician ratings in studies requiring a minimum score for study entry has also been reported. Inclusion bias has been observed in studies requiring patients to present with a baseline score no greater than a defined threshold. A recent placebo-controlled pilot study of St John's Wort for the treatment of social phobia required a HAMD score ≤ 16 for inclusion into the study [39•]. Of the 40 patients that were assessed at baseline, none were excluded based upon clinician assessed HAMD. However, self-report HAMD data (obtained by paper questionnaire [40]) indicated that 16 out of 40 (40%) patients had a self-report HAMD score > 16, with self-report scores being an average of 5.05 points higher than clinician ratings (p < 0.001). This phenomenon of inclusion bias is not confined to depression ratings. A study of 624 social anxiety disorder patients was presented at the 2002 American College of Neuropsychopharmacology meeting [41••]. In this study, patients required a clinician rated HAMA score of ≥ 20 to enter the open-label (active) 6-week run-in phase. Responders at week 6, defined as achieving a HAMA score ≤ 11 and exhibiting a minimum change from baseline HAMA of 50%, were then included in the subsequent 6month double-blind relapse prevention phase. By comparison of IVR and clinician rated HAMA scores, the researchers reported inflation of clinician scores at baseline to meet the entry criterion of HAMA score ≥ 20, and deflation of clinician scores at the end of the open-label phase to meet the continuation criterion of HAMA score ≤ 11. Clinician scores were, on average, 2.7 points higher at baseline and 2.9 points lower at the end of the open-label phase compared with computerized assessments (p < 0.05). Sensitivity to detection of treatment effects In addition to the elimination of inclusion bias, computeradministered ratings have been associated with the measurement of smaller changes from baseline and larger treatment effects relative to placebo, compared with those observed using clinician ratings. The smaller changes from baseline can be explained by the elimination of baseline score inflation/deflation observed among clinician ratings, but the increased sensitivity to demonstrate treatment separation is a significant finding that may benefit future study design. Rayamajhi et al reported the effects of 8 weeks of treatment with duloxetine, fluoxetine or placebo in 307 major depressive disorder patients (Figure 3) [38••]. Clinicianrated baseline adjusted least-square mean changes from baseline to endpoint HAMD were -5.99, -8.11 and -6.59 for placebo, duloxetine and fluoxetine, respectively. The corresponding mean changes from baseline calculated using IVR HAMDs were -4.04, -6.29 and -5.84, respectively. As discussed above, the lower changes from baseline within each treatment group may be associated with clinician score inflation at baseline. Petkova et al proposed the simultaneous use of both patient self-ratings and clinician ratings in non-psychotic and non-demented populations, so that the similarity between clinician ratings and patient selfratings can provide a measure of bias in clinician ratings [42•]. Despite the smaller changes from baseline, analysis of IVR-rated assessments in this study gave evidence of increased relative treatment effects and greater power to demonstrate treatment separation. The analysis of both clinician- and IVR-rated data highlighted significant improvements in depression severity attributable to duloxetine compared with placebo: baseline adjusted leastsquare mean changes from baseline compared to placebo were -2.12 (p = 0.008) and -2.25 (p = 0.006) for clinician and IVR ratings, respectively. For fluoxetine-treated patients, however, clinician ratings showed little evidence of improvement relative to placebo, whereas the IVR ratings indicated a much larger improvement that approached statistical significance: baseline adjusted least-square mean changes from baseline compared to placebo were -0.6 (p = 0.536) and -1.8 (p = 0.076) for clinician and IVR ratings, respectively. The fact that the fluoxetine arm was underpowered in this study, having half as many patients as the duloxetine arm, makes this result all the more impressive. 380 Current Opinion in Drug Discovery & Development 2005 Vol 8 No 3 Figure 3. Baseline adjusted least-square mean changes from baseline HAMD. A Clinician rated IVR patient rated Clinician rated IVR patient rated Baseline adjusted mean square change 0.0 0.0 -2.0 -2.0 Placebo Placebo Duloxetine Duloxetine Fluoxetine Fluoxetine -4.0 -4.0 -4.04 -4.04 -6.0 -6.0 -5.99 -5.99 -6.29 -6.29 -6.59 -6.59 -5.84 -5.84 -8.0 -8.0 -8.11 -8.11 -10.0 -10.0 Clinician rated Clinician rated B IVR patient rated IVR patient rated Baseline adjusted mean square change 0.0 -0.5 -0.5 -0.6 -0.6 p = 0.536 -1.0 -1.0 Duloxetine-placebo Duloxetine-placebo -1.5 -1.5 Fluoxetine-placebo Fluoxetine-placebo -1.8 -1.8 -2.0 -2.0 -2.12 -2.12 -2.5 -2.5 p = 0.008 p = 0.076 -2.25 -2.25 p = 0.006 -3.0 -3.0 (A) Baseline adjusted least-square mean changes from baseline HAMD following 8 weeks of treatment with placebo, duloxetine or fluoxetine. (B) Corresponding treatment changes relative to placebo, in a study of 307 patients with major depressive disorder [38••]. This is not an isolated result. For example, using the IVR HAMA clinical assessment, similar indications of increased sensitivity to treatment-related effects have been reported [41••]. Conclusions The data reported in the literature and at scientific meetings over the past decade, predominantly in the last five years, have demonstrated the potential of using computerized assessments in clinical trials. More significantly, they provide evidence of the equivalence between computerdelivered patient interviews and ratings, and the corresponding clinician-delivered versions. Examples cited in this review include the HAMD, HAMA and MADRS instruments commonly used in depression and anxiety clinical trials. Such evidence has been pivotal in recent FDA Computer-administered self-report data in CNS clinical trials Byrom & Mundt 381 acceptance of IVR HAMD as a primary endpoint in depression outpatient trials. The studies reviewed here have highlighted the advantages and utility of computerized clinical assessments, in particular in elimination of the inclusion bias observed when subjective clinician ratings form the basis of patient eligibility. In addition, the reduction of inter-rater variability and reduced placebo effect due to elimination of baseline score inflation gives computerized clinical assessments increased sensitivity to detect treatment-related differences from placebo. Because of the success in developing and validating equivalent measures of severity using computerized clinical assessments, approaches for taking greater advantage of the unique capabilities of computer-administered selfassessment continue to be developed. One such development that has provided favorable results is the development of assessments to measure the speed of onset of CNS drugs. Assessments commonly used in depression studies, such as HAMD and MADRS, are designed for weekly use, but recent research suggests that patients receiving newer antidepressants may exhibit symptom improvement before the end of the first week of treatment [43]. Computerized assessments that can be selfcompleted at home using electronic diary solutions facilitate the reliable collection of such data, and can do so with minimal reporting burdens for patients [44]. For example, in a 12-week open-label trial using daily reports during the first week of treatment, 137 depressed patients randomized to one of two doses rated their pain symptoms, and emotional and physical improvements using an IVR system. Results demonstrated that patients receiving the higher dose reported significant symptom improvement as early as one day after treatment [45••]. The daily telephone assessment (DTA©) utilizes the same approach to daily patient selfreport of the severity of core depressive symptoms that might provide better precision of therapeutic onset of action [46]. Further research is required to develop computerized self-assessment instruments that can measure speed of onset of efficacy and adverse events for use in other indications. Another area in which telephone-based clinical assessments are showing promise is in self-assessment of improvement. Although retrospective assessments of change can be more sensitive than serial assessments made at multiple time points [47], retrospective recall is problematic as memory is largely a reconstructive process and the accuracy of reconstruction is profoundly affected by the passage of time. Memory enhanced retrospective evaluation of treatment (MERET®) uses an IVR interface to permit patients enrolled in clinical studies to describe feelings and experiences related to their condition at baseline in their own voice and words. These descriptions are recorded and then played back to patients prior to making subsequent retrospective reports concerning change after treatment. Preliminary results with the technique have been encouraging [48•], suggesting that the process may increase apparent treatment effects relative to current methods for obtaining patient global impressions of improvement. A final area of ongoing research involves the capture and acoustical analysis of the speech patterns of patients. Recent studies of voice acoustical analysis of patients with extremely early-stage Parkinson's disease have produced exciting findings [49], and pilot research with depressed patients also appears promising [50]. A preliminary methodology study comparing the acoustical measures made from recordings obtained using state-of-the-art laboratory recording equipment and simultaneous recording over the telephone using an IVR system was also encouraging, indicating that the data obtained by both methods were highly comparable [51]. This opens the possibility of using such inexpensive techniques in largescale clinical trials. Ongoing research, currently being funded by the National Institute on Mental Health, will continue to explore and develop these techniques. These future directions will complement the existing achievements in this area to ensure that computerized clinical assessment, in particular using IVR, make a valuable contribution to current CNS clinical trials. References •• • of outstanding interest of special interest 1. Note for guidance on clinical investigation of medicinal products in the treatment of asthma: The European Medicines Agency, London, UK (2002). www.emea.eu.int/pdfs/human/ewp/292201en.pdf 2. Note for guidance on clinical investigation of steroid contraceptives in women: The European Medicines Agency, London, UK (2000). www.emea.eu.int/pdfs/human/ewp/051998en.pdf 3. Byrom B, Stein D, Greist J: A hotline to better patient data. Good Clin Practices J (2005) 12(2):12-15. •• Review containing full details of the FDA statement regarding IVR HAMD acceptability. 4. Rush AJ, Gullion CM, Basco MR, Jarrett RB, Trivedi MH: The inventory of depressive symptomatology (IDS): Psychometric properties. Psychol Med (1996) 26(3):477-486. 5. Verschelden P, Cartier A, L'Archeveque J, Trudeau C, Malo JL: Compliance with and accuracy of daily self-assessment of peak expiratory flows (PEF) in asthmatic subjects over a three month period. Eur Respir J (1996) 9(5):880-885. 6. Ryan JM, Corry JR, Attewell R, Smithson MJ: A comparison of an electronic version of the SF-36 general health questionnaire to the standard paper version. Qual Life Res (2002) 11(1):19-26. 7. Stone AA, Shiffman S, Schwartz JE, Broderick JE, Hufford MR: Patient non-compliance with paper diaries. Br Med J (2002) 324(7347):11931194. • Study illustrating the limitations of paper diaries, with particular regard to retrospective and prospective completion. 8. Khan A, Leventhal RM, Khan SR, Brown WA: Severity of depression and response to antidepressants and placebo: An analysis of the Food and Drug Administration database. J Clin Psychopharmacol (2002) 22(1):40-45. • Meta-analysis identifying that > 50% of new antidepressant treatment arms fail to show statistical significance compared with placebo. 9. Khan A, Khan S, Brown WA: Are placebo controls necessary to test new antidepressants and anxiolytics? Int J Neuropsychopharmacol (2002) 5(3):193-197. • Meta-analysis identifying that > 50% of new anxiolytic treatment arms fail to show statistical significance compared with placebo. 382 Current Opinion in Drug Discovery & Development 2005 Vol 8 No 3 Khan A, Kolts RL, Thase ME, Krishnan KR, Brown W: Research design features and patient characteristics associated with the outcome of antidepressant clinical trials. Am J Psychiatry (2004) 161(11):20452049. • Factor analysis identifying baseline disease severity as the major factor in antidepressant study failure. The paper also cites flexible dosing regimens, number of treatment arms and proportion of female patients as important factors. 27. 11. 29. 10. Kobak KA, Greist JH, Jefferson JW, Katzelnick DJ, Mundt JC: New technologies to improve clinical trials. J Clin Psychopharmacol (2001) 21(3):255-256. 12. Greist JH, Mundt JC, Kobak K: Factors contributing to failed trials of new agents: Can technology prevent some problems? J Clin Psychiatry (2002) 63(Suppl 2):8-13. • Useful review of factors contributing to failed CNS trials and patient acceptability of IVR computerized assessments. 13. Hamilton M: A rating scale for depression. J Neurol Neurosurg Psychiatry (1960) 23:56-62. 14. Montgomery SA, Åsberg M: A new depression scale designed to be sensitive to change. Br J Psychiatry (1979) 134:382-389. 15. Hamilton M: The assessment of anxiety states by rating. Br J Med Psychol (1959) 32(1):50-55. 16. Liebowitz MR: Social phobia. Mod Probl Pharmacopsychiatry (1987) 22:141-173. 17. Goodman WK, Price LH, Rasmussen SA, Mazure C, Fleischmann RL, Hill CL, Heninger GR, Charney DS: The Yale-Brown obsessive compulsive scale: I. Development, use and reliability. Arch Gen Psychiatry (1989) 46(11):1006-1011. 18. Overall JE, Gorham DR: The brief psychiatric rating scale. Psychol Rep (1962) 10:799-812. 19. Andreasen NC: The scale for the assessment of negative symptoms (SANS): Conceptual and theoretical foundations. Br J Psychiatry Suppl (1989) (7):49-58. 20. Kay SR, Fiszbein A, Opler LA: The positive and negative syndrome scale (PANSS) for schizophrenia. Schizophrenia Bull (1987) 13(2):261-276. 21. Note for guidance on clinical investigation of medicinal products in the treatment of depression: The European Medicines Agency, London, UK (2002). www.emea.eu.int/pdfs/ human/ewp/056798en.pdf 22. Demitrack MA, Faries D, DeBrota D, Potter WZ: The problem of measurement error in multisite clinical trials. Psychopharm Bull (1997) 33:513. • Short article describing the inter-rater variation in HAMD scores following an investigator training event. 23. Feiger A, Engelhardt N, DeBrota D, Cogger K, Lipsitz J, Sikich D, Kobak K: Rating the raters: An evaluation of audio taped Hamilton depression rating acale (HAMD) interviews. 43rd Annual Meeting of the New Clinical Drug Evaluation Unit Program, Boca Raton, FL, USA (2003). 24. Millard RW, Carver JR: Cross-sectional comparison of live and interactive voice recognition administration of the SF-12 health status survey. Am J Manag Care (1999) 5(2):153-159. 25. Kobak KA, Taylor LH, Dottl SL, Greist JH, Jefferson JW, Burroughs D, Mantle JM, Katzelnick DJ, Norton R, Henk HJ, Serlin RC: A computeradministered telephone interview to identify mental disorders. J Am Med Assoc (1997) 278(10):905-910. • Describes computerized diagnostic assessments for psychiatric disorders. Patients demonstrated increased honesty to sensitive questions using computerized (IVR) assessment, leading to a higher prevalence of diagnosis of alcohol/drug abuse. 26. Turner CF, Ku L, Rogers SM, Lindberg LD, Pleck JH, Sonenstein FL: Adolescent sexual behavior, drug use and violence: Increased reporting with computer survey technology. Science (1998) 280(5365):867-873. • Computer-assisted interviews provide increased reporting of sensitive issues including injection drug use and sexual contact. Kobak KA, Mundt JC, Greist JH, Katzelnick DJ, Jefferson JW: Computer assessment of depression: Automating the Hamilton depression rating scale. Drug Info J (2000) 34(1):145-156. •• Valuable review and meta-analysis of validation studies comparing clinician- and IVR-administered HAMDs. 28. Cronbach LJ: Coefficient α and the internal structure of tests. Psychometrika (1951) 16(3):297-334. Kobak KA, Reynolds WM, Greist JH: Development and validation of a computer-administered version of the Hamilton anxiety scale. Psychological Assess (1993) 5:487-492. • Evidence for the concordance between clinician- and IVR-administered HAMA ratings. 30. Kobak KA, Greist JH, Jefferson JW, Katzelnick DJ, Greist R: Validation of a computerized version of the Hamilton anxiety scale administered over the telephone using interactive voice response. American Psychiatric Association 151st Annual Meeting, Toronto, Canada, (1998). 31. Beck AT, Epstein N, Brown G, Steer RA: An inventory for measuring clinical anxiety: Psychometric properties. J Consult Clin Psychol (1988) 56(6):893-897. 32. Katzelnick DJ, Mundt JC, Kennedy S, Eisnfeld B, Greist JH: Development and validation of a computer-administered version of the Montgomery-Åsberg rating scale (MADRS) using interactive voice response (IVR) technology. J Psychiatr Res (2005): manuscript submitted. •• Results of a pilot study highlighting the concordance between clinician- and IVR-administered MADRS ratings. 33. Kobak KA, Greist JH, Jefferson JW, Katzelnick DJ, Schaettle SC: Computerized assessment in clinical drug trials: A review. 37th Annual Meeting of the New Clinical Drug Evaluation Unit Program, Boca Raton, FL, USA (1997). 34. Kobak KA, Greist JH, Jefferson JW, Katzelnick DJ: Validation of a computer administered version of the Liebowitz social anxiety scale administered by telephone via interactive voice response (IVR). 42nd Annual Meeting of the New Clinical Drug Evaluation Unit Program, Boca Raton, FL, USA (2002). 35. Katzelnick DJ, Kobak KA, DeLeire T, Henk HJ, Greist JH, Davidson JR, Schneider FR, Stein MB, Helstad CP: Impact of generalized social anxiety disorder in managed care. Am J Psychiatry (2001) 158(12):1999-2007. 36. Ewing H, Reesal R, Kobak KA, Greist JH: Patient satisfaction with computerized assessment in a multicenter clinical trial. 38th Annual Meeting of the New Clinical Drug Evaluation Unit Program, Boca Raton, FL, USA (1998). • Good patient acceptability data using IVR clinical assessments. 37. DeBrota DJ, Demitrack MA, Landin R, Kobak KA, Greist JH, Potter WZ: A comparison between interactive voice response systemadministered HAMD and clinician administered HAMD in patients with major depressive episode. 39th Annual Meeting of the New Clinical Drug Evaluation Unit Program, Boca Raton, FL, USA (1999). •• Study providing strong evidence of baseline score inflation with clinicianadministered HAMD ratings. 38. Rayamajhi J, Lu Y, DeBrota D, Greist J, Demitrack M: A comparison between interactive voice response system and clinician administration of the Hamilton depression rating scale. 42nd Annual Meeting of the New Clinical Drug Evaluation Unit Program, Boca Raton, FL, USA (2002). •• Report from a study highlighting baseline score inflation and increased sensitivity to detecting treatment effects using the IVR-administered HAMD compared with clinician-administered ratings. 39. Kobak KA: Saint John's Wort vs placebo in social phobia: Results from a placebo-controlled pilot study. NIMH Grant 1R21AT00502, NCCAM, Dean Foundation, Madison, WI, USA (2004). • Report from a study demonstrating inclusion bias when a maximum HAMD score is required for study entry (baseline score deflation). 40. Reynolds WM, Kobak K: Reliability and validity of the Hamilton depression inventory: A paper-and-pencil version of the Hamilton depression rating scale clinical interview. Psychological Assess (1995) 7:472-483. Computer-administered self-report data in CNS clinical trials Byrom & Mundt 383 41. Feltner DE, Kavoussi R, Haber H, Kobak KA, Greist JH, Pande AC: Evaluation of an interactive voice response HAMA in a GAD relapse prevention trial. American College of Neuropsychopharmacology 41st Annual Meeting, San Juan, Puerto Rico (2002). •• A study demonstrating baseline score inflation with clinician-administered HAMA, and improved sensitivity to detect treatment effects using computerized assessment. Petkova E, Quitkin FM, McGrath PJ, Stewart JW, Klein DF: A method to quantify rater bias in antidepressant trials. Neuropsychopharmacology (2000) 22(6):559-565. • Article suggesting that self-assessment should be used as a measure of clinician bias. 46. DeBrota DJ, Engelhardt N, Mundt JC, Greist JH, Verfaille SJ: Daily assessment of depression severity with an interactive voice response system (IVRS). 43rd Annual Meeting of the New Clinical Drug Evaluation Unit Program, Boca Raton, FL, USA (2003). 47. Fischer D, Stewart AL, Bloch, DA, Lorig K, Laurent D, Holman H: Capturing the patient's view of change as a clinical outcome measure. J Am Med Assoc (1999) 282(12):1157-1162. 42. 43. Brannan SK, Mallinckrodt CH, Detke MJ, Watkin JG, Tollefson GD: Onset of action for duloxetine 60 mg once daily: Double-blind, placebo-controlled studies. J Psychiatr Res (2005) 39(2):161-172. 44. Mundt JC: Interactive voice response (IVR) systems in clinical research and treatment. Psychiatric Serv (1997) 48(5):611-612. 45. Moore HK, Woolreich MM, Wilson MG, Mundt JC, Fava M, Mallinckrodt CH, Arnold L, Greist JH: A novel approach to measuring onset of efficacy of duloxetine in mood and painful symptoms of depression. American College of Neuropsychopharmacology 43rd Annual Meeting, San Juan, Puerto Rico (2004). •• Study identifying rapid onset treatment effects using daily IVR assessments. 48. Mundt JC, Brannan S, DeBrota DJ, Detke MJ, Greist JH: Memory enhanced retrospective evaluation of treatment: MERET. 43rd Annual Meeting of the New Clinical Drug Evaluation Unit Program, Boca Raton, FL, USA (2003). • A study showing increased sensitivity of patient impression of improvement measures when using a telephone-based assessment with audio replay of a self-recorded anchor message describing baseline condition. 49. Harel B, Cannizzaro M, Snyder PJ: Variability in fundamental frequency during speech in prodromal and incipient Parkinson's disease: A longitudinal case study. Brain Cogn (2004) 56(1):24-29. 50. Cannizzaro M, Harel B, Reilly N, Chappell P, Snyder PJ: Voice acoustical measurement of the severity of major depression. Brain Cogn (2004) 56(1):30-35. 51. Cannizzaro MS, Reilly N, Mundt JC, Snyder PJ: Remote capture of human voice acoustical data by telephone: A methods study. Clin Linguistics Phonetics (2005): in press.