Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
University of Western Sydney School of Computing and Mathematics A Framework for an Intelligent Decision Support System (IDSS), Including a Data Mining Methodology, for Fetal-Maternal Clinical Practice and Research. By Jennifer Heath A dissertation submitted in fulfilment of the requirements of Master of Science (Hons) November, 2006 CERTIFICATION I, Jennifer Heath, certify that this thesis, submitted to address the requirements for the Award of Master of Science(Honours), in the School of Computing and Mathematics, University of Western Sydney, is wholly my own work unless otherwise referenced or acknowledged. This document has not been submitted at any other Academic institution to meet requirements for any Award. 16th November 2006 Jennifer Heath 2 TABLE OF CONTENTS CERTIFICATION.......................................................................................................2 ACKNOWLEDGEMENTS ........................................................................................7 PUBLICATIONS RELATED TO DISSERTATION...............................................8 ABSTRACT..................................................................................................................9 CHAPTER 1 INTRODUCTION..........................................................................10 1.1 INTRODUCTION......................................................................................10 1.2 RESEARCH AIMS AND OBJECTIVES................................................12 1.3 RESEARCH RATIONALE ......................................................................12 1.3.1 Knowledge Discovery in Data (KDD) Approaches for Clinical Research .....12 1.3.2 Limitations of Existing OnLine Transaction Processing Systems for the Fetal-Maternal Domain. ....................................................................................13 1.3.3 Slow Uptake of IDSS in the Fetal-Maternal Medical Domain..............14 1.4 RESEARCH APPROACH........................................................................15 1.5 CONTRIBUTIONS TO KNOWLEDGE.................................................19 1.6 THESIS OVERVIEW ...............................................................................19 CHAPTER 2 LITERATURE REVIEW..................................................................21 2.1 INTRODUCTION......................................................................................21 2.2 KNOWLEDGE DISCOVERY IN DATA (KDD) ...................................21 2.2.1 General Process of KDD....................................................................21 2.2.2 Data Mining........................................................................................26 2.2.3 Exploratory/Descriptive Data Mining..............................................31 2.2.4 Confirmatory/Predictive Data Mining.............................................32 2.2.5 Classification and Prediction ............................................................33 2.2.6 CRISP-DM..........................................................................................34 2.2.7 Critics ..................................................................................................36 2.2.8 Data Mining in a Medical Domain ...................................................37 2.2.9 Conclusions and Implications for this Research .............................38 2.3 INTELLIGENT DECISION SUPPORT SYSTEMS (IDSS).................38 2.3.1 Evolution of Intelligent Decision Support Systems.........................38 2.3.2 IDSS Components ..............................................................................43 2.3.2.1 Extraction-Transformation-Loading(ETL) Frameworks..........44 2.3.2.2 Data Warehouse Architectures.....................................................45 2.3.2.3 Knowledge Discovery in Data / Data Mining Frameworks .......48 2.3.2.4 Knowledge Base Architectures .....................................................48 2.3.2.5 Model Base Architectures .............................................................49 2.3.3 Conclusions and Implications for this Research ....................................49 2.4 IDSS for Clinical Practice and Research .................................................50 2.5.1 IDSS for the medical domain. ...........................................................50 2.4.2 Unrealized Potential for IDSS in the Medical Domain...................52 2.4.3 DxPlain – a medical IDSS from the United States of America. .....56 2.4.4 Medical IDSS in the Australian Context .........................................57 2.4.5 Conclusions and Implications for this Research .............................61 2.5 EXPERIMENT DESIGN AND CLINICAL TRIALS............................62 2.6.1 Dominant Medical Research Paradigm ...........................................67 2.6.2 Clinical Reasoning, Statistical Reasoning and KDD ......................67 2.6.3 Conclusions and Implications for Research ....................................69 CHAPTER 3 CROSS INDUSTRY STANDARD PROCESS – DATA MINING (CRISP-DM) SPECIALIZED TASKS FOR MEDICAL RESEARCH................70 3.1 Introduction......................................................................................................70 3 3.2 Extended CRISP-DM enhancing Data Management Layer of IDSS..........71 3.3 ‘Outputs’ from Data Mining are ‘Inputs’ to the Knowledge Base of IDSS 72 3.4 CRISP-DM Specialised Tasks to Support Medical Research......................72 3.4.1 Initial Phase ........................................................................................74 3.4.2 Data Preparation Phase.....................................................................75 3.4.3 Testing Phase......................................................................................76 3.4.4 Assessment Phase ...............................................................................77 3.4.5 Usage Phase ........................................................................................77 3.4.6 The role of exploratory and confirmatory data mining .................78 3.5 Generation of Electronic ‘Rules’ for use in IDSS knowledge base .............82 CHAPTER 4 INTELLIGENT DECISION SUPPORT SYSTEMS (IDSS) FOR CLINICAL PRACTICE AND RESEARCH...........................................................84 4.1 Introduction......................................................................................................84 4.2 Extraction-Transformation and Loading (ETL) Frameworks....................84 4.3 Data Warehousing Architectures ...................................................................86 4.3.1 Core Dimensions for IDSS ................................................................87 4.3.2 Domain Ontology ...............................................................................89 4.4 Data Mining / Knowledge Discovery in Data Frameworks .........................89 4.5 Knowledge Base Architectures .......................................................................90 4.6 Model Base Architectures ...............................................................................90 CHAPTER 5 ‘DATABABES’ CASE STUDY..........................................................92 5.1 Introduction......................................................................................................92 5.2 Aim ....................................................................................................................93 5.3 Methods.............................................................................................................94 5.4 Results ...............................................................................................................97 5.5 Conclusions.......................................................................................................97 CHAPTER 6 CONCLUSION...................................................................................99 6.1 Contribution to Knowledge.............................................................................99 6.2 Future Research .............................................................................................100 6.3 Conclusion ......................................................................................................101 REFERENCES.........................................................................................................103 APPENDIX A ...........................................................................................................110 4 FIGURES Figure Figure 1.1 Susman and Evered’s Action Research Cycle Figure 2.1 ‘We are data rich, but information poor’ Figure 2.2 ‘Searching for knowledge(interesting patterns) in your data’ Figure 2.3 Data mining as a step in the process of knowledge discovery Figure 2.4 Fields contributing to data mining Figure 2.5 Concept hierarchy for attribute age Figure 2.6 CRISP-DM phases Figure 2.7 Four level breakdown of CRISP-DM methodology Figure 2.8 Turban and Aronson’s IDSS Figure 2.9 Heath and McGregor’s five zones of interest Figure 2.10 A general model of an EDSS Figure 2.11 The Scientific Method Figure 2.12 Continuum of Experiment Design Figure 2.13 Convergence of Clinical and Statistical Reasoning Paradigm Figure 2.14 Addition of KDD to Reasoning paradigm Figure 3.1 Framework for IDSS, data management highlighted Figure 3.2 Framework for IDSS, knowledge base highlighted Figure 3.3 Parallelism between CRISP-DM and the Scientific Method Figure 3.4 CRISP-DM extended for Clinical Practice and Research Figure 4.1 Proposed IDSS Framework Figure 5.1 ‘DataBabes’ components Figure 5.2 Chorionic Villus Sampling star schema 5 Page 16 22 22 23 29 30 34 35 43 44 60 63 65 68 69 71 72 73 81 84 95 96 TABLES TABLE Table 2.1 Data mining operations and techniques Table 2.2 Common type of DSS support Table 2.3 Features of Clinical Decision Support Systems(CDSS) important for CDSS effectiveness Table 4.1 Datatype coding 6 Page 28 40 52 87 ACKNOWLEDGEMENTS I must begin by acknowledging the invaluable guidance provided to me by my Research Supervisor Dr Carolyn McGregor. Wrestling with the entire research process was frustrating at times and Dr McGregor never once faltered in her patience and quiet encouragement. I was fortunate in that Dr McGregor offered both industry experience in the fields of data warehousing and intelligent decision support systems and current, successful research experience in Health Informatics. I lost count of how many times I baulked at submitting abstracts, making presentations, applying for grants only to have Dr McGregor listen to all my doubts, reject them all and have me continue on over the next research ‘hurdle’. Dr Liwan Liyanage joined as my co-supervisor towards the end of this research and I also thank her for providing encouragement and insight. Associate Professor John Smoleniec and Susan Heath, from the Fetalmaternal Unit at Liverpool Hospital have provided enthusiasm and plenty of research support. Specifically they assisted in gaining ethics approvals, research paper production and answering countless questions regarding the data found in the Fetal-maternal domain. There is no way this research would have been completed without Sue and John’s constant support and cooperation. Last, but by no means least, I thank my husband Brad Wulff. Brad has been very patient and provided lots of help with our children (Zoe and Megan) as I tried the balancing act of wife, mother, full–time Academic and research student. I appreciate his good humoured support and acknowledge that none of this would be possible without him. Thank You Carolyn, Liwan, John, Sue, and Brad. 7 PUBLICATIONS RELATED TO DISSERTATION Heath, J., McGregor, C., & Smoleniec, J. (2005) DataBabes: A Case Study in Fetal-maternal Clinical Data Mining, Health Informatics Society of Australia General Conference, Melbourne, August 2005, CD ROM, 6 pages. Smoleniec, J., Heath, S., Heath, J., & McGregor, C. (2005) DataBabes: A Case Study in Fetal-maternal Clinical Data Mining, Poster, Perinatal Society of Australia and New Zealand (PSANZ) , Sydney, March 2005. Heath, J., Heath, S., McGregor, C., & Smoleniec, J. (2004) DataBabes: A Case Study in Data Warehousing and Mining Perinatal Data, CASEMIX Conference, Sydney, October 2004. Heath, J., & McGregor, C. (2004) Research Issues in Intelligent Decision Support, UWS College of Science Technology & Environment Innovation Conference, Sydney, June 2004. Smoleniec, J., Heath, S., Heath, J., & McGregor, C. (2004) Fetal-maternal Data Warehouse and Data Mining, Poster, Perinatal Society of Australia and New Zealand (PSANZ) , Sydney, March 2004. 8 ABSTRACT Existing patient medical records are a rich data source with a potential to support clinical research. Fragmentation of data across disparate medical databases inhibits the use of these existing datasets. Overcoming such disjointedness is possible through the use of a data warehouse. Once the data is cleansed, transformed and stored within the data warehouse it is possible to turn attention to the exploration of the medical datasets. Exploratory and confirmatory Data Mining tools are well suited to such activities. Traditionally medical research has been conducted in accordance with the scientific method. Informal discussions with medical practitioners exposed a lack of confidence in data mining activities as they are perceived to not support the scientific method. This thesis demonstrates that there are strong parallels between the scientific method and the Cross-Industry Standard Process – Data Mining (CRISP-DM). Extensions to CRISP-DM, as proposed in this thesis, can be provided to strengthen these parallels. Establishing a clinical trial to investigate conditions such as lung cancer is relatively straight forward given the large number of potential patients when compared to rare, complex conditions found in the fetal-maternal domain. The use of the extended CRISP-DM enables use of existing patient data and sophisticated data mining techniques to generate potential ‘knowledge’. The knowledge rules generated can, following clinician review, be used to populate Knowledge Base Architectures in Intelligent Decision Support Systems thus helping to overcome the labour intensive elicitation of domain knowledge that hinders the establishment of IDSS in the medical domain. This thesis is concerned with: demonstrating parallels between scientific method and CRISP-DM; extending CRISP-DM for use with medical datasets; and proposal of the supporting Intelligent Decision Support System framework. This research has been undertaken using a fetal-maternal case study. 9 CHAPTER 1 1.1 INTRODUCTION INTRODUCTION The use of data mining operations across patient medical data is gaining interest in the research community. Little research has focussed on improvements to data mining methodologies specifically targeting the medical domain. This thesis proposes extensions to the Cross Industry Standard Process- Data Mining (CRISP-DM) methodology specifically to accommodate the demands of exploratory and confirmatory data mining across existing patient data. The enhancements are made in the modelling and evaluation phases of CRISP-DM specifically to support the null hypothesis medical research paradigm. Despite the clear demands of evidence based medicine, little research has focussed on meeting the requirements of null hypothesis driven data mining and this thesis presents the parallelism that exists between the extended CRISP-DM and the scientific method. Efficient elicitation of medical domain knowledge for inclusion in Intelligent Decision Support System(IDSS) knowledge bases remains an open research area. Using the extended CRISP-DM in knowledge discovery assists in the automated extraction of domain knowledge from existing patient data. This illustrates the immediate application of the proposed methodology to further the research surrounding IDSS domain knowledge capture. The knowledge thus generated, in an electronic format, can be used to populate the knowledge component in an IDSS when a rule generating data mining technique is chosen in the modelling phase of CRISP-DM. The results of the Australian Government sponsored Electronic Decision Support for Australia’s Health Sector study (Australian Health Information Council, 2002) indicate that Australian medical clinicians are reluctant to 10 adopt electronic decision support systems partly due to concern regarding the content of knowledge bases. This research focuses on the following open research areas: 1. Existing investigative methods used when data mining across patient medical data are inadequate for the demands of clinical practice and research. The null hypothesis driven medical research paradigm must inform data mining investigative methods in the medical domain. 2. In the medical domain improvement is required in the elicitation of domain knowledge for use within knowledge bases in IDSS. 3. The exploitation of IDSS in the medical domain, particularly in the Australian context, has been slow and clinicians have concerns regarding the content of knowledge bases found in IDSS. Aspects of this research are explored using a fetal-maternal case study. 11 1.2 RESEARCH AIMS AND OBJECTIVES The hypotheses raised in support of the research rationale described above are: Research Hypothesis 1: The Cross Industry Standard Process – Data Mining(CRISP-DM) can be extended to enable its use in medical research driven by the null hypothesis paradigm. Research Hypothesis 2: An Intelligent Decision Support System (IDSS) can be defined for clinical practice and research including a data management component to exploit the extended CRISP-DM methodology. 1.3 RESEARCH RATIONALE 1.3.1 Knowledge Discovery in Data(KDD) Approaches for Clinical Research. Within the medical domain the data mining focus has been on hospital organisational issues (Alexandrini, Krechel, Maximini & von Wangenheim, 2003; Berndt, Fisher, Hevner & Studnicki, 2001; Ewen, Medsker, Dusterhoft, Levan-Schultz, Smith & Gottschall, 1999; Lyman, Boyd & Dalton, 2003; Raghupathi, Winiwarter, Werner, & Tan, 2002; Rao, Niculescu, Germond & Rao 2003; Rindfleisch, 1997; Schubart, Einbinder, 2000; Zaidi, Abidi & Manickam, 2002) such as efficient financial management rather than patient data research. Knowledge Discovery in Data researchers have documented the complex nature of medical data and describe the unique difficulties encountered when working with such data. ( Goodwin, Mahler, Ochno-Machado, Iannacchione, Crockett, Dreiseitls, Vinterbo & Hammond, 2000; Goodwin & GrzymalaBusse, 2001; Goodwin, Iannacchione, Hammond, Crockett, Mahler, & Schlitz 2001; Jung & Gudivada, 1995; Lee & Abbott, 2003; Masuda & Sakamoto, 2002; Podgorelec, Kokol & Stiglic, 2002; Roddick, Fule & Graco, 2003). As early as 1998 researchers such as Brossette (Brossette, Sprague, Hardin, Waites, Jones & Moser, 1998) were attempting to use data mining techniques to contribute to medical knowledge. However, the 12 application of data mining to historical patient medical data has mostly been conducted in isolated research work (Goodwin et al. 2001; Goodwin et al., 2001; Jung & Gudivada, 1995; Kovalerchuk, Vityaev & Ruiz, 2000; Masuda et al., 2002; Matsumoto, Ueda & Kawaji 2002; Povalej, lenic, Zorman, Kokol, Peterson & Lane, 2003; Roddick et al., 2003; Tsymbal & Aronson, 2003; Wong, Lam, Leung, Ngan & Cheng, 2000) and has not made a transition to being a widely accepted technique to inform clinical practice and research. The Scientific Method’s null-hypothesis driver used in medical research calls for a modification to the approach used for data mining across medical data and yet Roddick et al. (2003) informs that little research has been directed towards formalisation of such a revised approach. Neither medical research nor computing research has focussed on this open research area. This thesis explains why such a change is needed and puts forward a KDD methodology that extends the widely accepted CRISP-DM. The extensions include use of exploratory and confirmatory data mining and support the null-hypothesis paradigm hence strengthening the value of Data Management components of IDSS specifically to support medical domains. 1.3.2 Limitations of Existing OnLine Transaction Processing Systems for the Fetal-maternal Domain. This research was motivated after a review of patient data analysis in association with Professor Smoleniec, Director Feto-Maternal Unit, Liverpool Hospital, South Western Sydney Area Health Service (SWSAHS). It emerged that the existing on-line transaction processing system used to assist in clinical practice was inadequate for clinical research, particularly multi-dimensional data analysis. A data warehouse combined with other elements of DSS would overcome some of the problems encountered at the fetal-maternal unit. The need to bring disparate information systems together to leverage knowledge discovery has been widely recognised and pursued by both the research and commercial community (Berndt et al., 2001; Bonifati, 13 Cattaneo, Ceri, Fuggetta & Paraboschi, 2001; Devlin, 1997; Ewen et al., 1999; Inmon, 2002; Mallach, 2000; Marakas, 2002b). Associate Professor Smoleniec’s aim was to leverage existing patient treatment data to assist in the practice of evidence-based medicine and to inform clinical research. Comments from Associate Professor Smoleniec and other clinicians raised my awareness of the questions that surround the use of KDD methodologies and tools in the generation of medical evidence. There was a reluctance to ‘accept’ KDD results due to the lack of rigour associated with the data mining activities. Associate Professor Smoleniec’s concerns regarding statistical significance and data bias directed my work towards an improved methodology – closer to the respected Scientific Method. 1.3.3 Slow Uptake of IDSS in the Fetal-maternal Medical Domain Little research has tackled the complex issues surrounding the establishment of a data warehouse and IDSS architecture in the fetal-maternal domain. Goodwin et al (2000; 2001) directed their efforts to the demanding, emotive, challenging field of fetal-maternal health. Establishment of such IDSS architectures has been rapid in areas driven by the commercial desire to show a ‘profit’, including marketing, share trading. In these domains it was perceived that the architectures could assist in discovering competitive advantage type knowledge from disparate information systems. More recently counter terrorism activities have explored IDSS and associated data warehousing and data mining functions (Popp, Armour, Senator & Numryk, 2004; Zdanowicz, 2004) and innovative Australian researchers have employed data mining techniques to assist in missing person profiling (Blackmore & Bossomaier, 2002), cotton growth (Johnson, 2004) and farm management(Robinson, 2005). Despite the successes in other industries the health sector has been relatively slow in adopting IDSS technologies. One of the inhibitors is the difficulty encountered when eliciting domain knowledge for use in knowledge bases within IDSS. 14 1.4 RESEARCH APPROACH This research is non-empirical in nature focussing on ideas and frameworks as described by Alavi and Carlson (1992). Continuing with the classifications presented by Alavi and Carlson (1992) this non-empirical research falls into the applied sub category with emphasis on conceptual and illustrative elements, thus covering the “why” and “how” of a framework for an IDSS, utilizing the null hypothesis medical research paradigm, for Clinical Practice and Research. An action research approach has been taken for this research. Galliers (1993) earlier work indicates such an approach is suitable for research aimed at methodology development. This approach is very suitable for the multidisciplinary nature of the research undertaken with a focus on core computer science knowledge as extended and applied in a complex fetal-maternal medicine environment. As with most action research projects, the aim here has been to improve practice through collaborative work between researchers and practitioners, with interventions at the research site to ‘test’ if the prototype and associated analysis methodologies were feasible. Awareness of the dangers inherent in this joint endeavour research approach ensured that I did not slip into merely a ‘consultancy’ role with the fetalmaternal collaborators. These dangers as described in earlier work (Avison, 2002; Susman & Evered, 1978), were overcome during the conduct of this research by: • Careful negotiation at the beginning of the research to ensure that there was an agreed set of aims from both the computer science and medical participants. Presentations related to this research have appeared at both professional medical conferences (Heath, Heath, McGregor & Smoleniec, 2004) and health informatics and computer science conferences (Heath & McGregor, 2004; Heath, McGregor & Smoleniec, 2005) which indicates the successful achievement of the common goals across both disciplines. The importance of this negotiation is 15 emphasised by Kock, McQueen and Baker (1996) in their work addressing the ‘initiative dilemma’ associated with action research • Clear statement of research aim, theory and method presented at the outset of this research work. This was necessary to gain ethics approval from both institutions – South Western Sydney Area Health Service (SWSAHS) and the University of Western Sydney (UWS). These applications were both successful and supported by SWSAHS and UWS. • Adherence to the action research process presented by Susman and Evered (1978), see Figure 1.1 below. Figure 1.1: Susman and Evered’s Action Research Cycle. (Susman & Evered, 1978) 16 • Inclusion of the ‘Specifying Learning’ phase ensures that rigour is included in this research approach. In addition this aspect ensures that a contribution to existing knowledge results from the research. Academic and Industry publications and presentations have emerged from this research (Heath et al., 2004; Heath & McGregor, 2004). This is largely due to the contribution of new knowledge that arose from this action research, hence ensuring this was not ‘just a one-off consultancy’, but rather a genuine contribution to knowledge. • The work of Avison, Lau and Myers (1999) reflects the use of Action Research for this fetal-maternal IDSS research, particularly in matters of ‘mutually acceptable ethical framework’ an essential requirement for Action Research participants to minimise conflict during research process. An interesting reflection on this research thesis is to consider the qualitative Action Research framework guiding the conduct of this research. Contained within this Action Research framework is acknowledgement of the strength of the null hypothesis, quantitative research paradigm and exploration of the potential for knowledge discovery in data to enhance such quantitative research. An exploration of the merits of qualitative v’s quantitative research is beyond the scope of this thesis, however this is an unusual thesis because both paradigms are well regarded and embraced. The double challenge of meeting both (1) the needs of action in an organisation and (2) quality research made this research project more difficult than a carefully constructed survey or experimental research involving a set of data specifically collected for research purposes. Avison et al. (1999) and Kock et al. (1996), considered the need to perform for two masters, the demands of immediate research clients and the Academic community in general. Action research is not an efficient research method as substantial time is spent preparing client data and this IDSS research in the 17 fetal-maternal domain had to wrestle with data quality and management issues throughout the research. Panel discussions at the 2004 8th Pacific Asia Knowledge Discovery in Data(KDD) conference in Sydney, greatly influenced my decision to adopt an action research paradigm. Eminent persons in the KDD field including Han (1995; 1996; 1998; 2002; Han et al., 1997; Han & Kamber, 2001; Han & Pei, 2000; Han et al., 2000; Wang, Fan, Yu & Han, 2003) and Webb (Webb, Han & Fayyad, 2004; Webb, 2001; Webb, Butler & Newlands, 2003; Webb, 2000) led panel discussions at the conference regarding future directions of KDD research. The panel emphasised that future research focus must be on tackling the use of previously described 'real world' data rather than carefully constructed test data sets that confirm relative efficiency of KDD algorithms. This is a challenge that I have undertaken in this thesis and associated research work which necessitated the use of an action research approach. 18 1.5 CONTRIBUTIONS TO KNOWLEDGE This research contributes four key elements to knowledge, specifically: (1) Extensions to the CRISP-DM to facilitate its use in clinical practice and medical research applications. (2) Recognition of the parallelism between CRISP-DM and the Scientific Method and the importance of the role played by both exploratory and confirmatory data mining. (3) A proposed framework for IDSS in a fetal-maternal domain. (4) Enhancements to the Data Management component within IDSS through exploiting (1) and (2) to generate domain knowledge for use in the Knowledge Base component of IDSS. 1.6 THESIS OVERVIEW Chapter 1 begins with a consideration of the drivers behind this research. These factors come from both the medical domain, specifically fetalmaternal, and computer science. The research approach used throughout the conduct of this cross-disciplinary is also presented in Chapter 1. Chapter 1 concludes with a concise summary of the contribution to knowledge made by this thesis and surrounding research activities. Chapter 2 contains a summary of some of the literature reviewed and considered in the conduct of this research. Initially the literature review conducted was quite broad. To aid in keeping the focus of this thesis directed towards the stated thesis hypotheses many of the literature reviewed sources that were uncovered during this research, but not directly impacting on this thesis have been omitted. Research included in the literature review is wide ranging from the works of the Electronic Decision Support in Australia committee to the British Medical Journal and mainstays of Computing research such as ACM SIGKDD, SIGMOD, IEEE and foundation DSS / KDD researchers such as Inmon, Aronson, Han and Kimball. 19 Chapter 3 presents the extended CRISP-DM building on the initial data mining concepts and open research questions described in the literature review. The parallelism between the Scientific Method and the extended CRISP-DM is also explored in this chapter. The Intelligent Decision Support System(IDSS) framework I propose is presented and discussed in Chapter 4. The ‘DataBabes’ Case study is presented in Chapter 5. This case study includes a brief description of the activities conducted in my research partner’s fetal-maternal unit at Liverpool Hospital. Details of a prototype data mart developed for clinical research on Chorionic Villus Sampling(CVS) is also presented in Chapter 5. This data mart instantiates the extraction, transformation and loading component and data warehouse component of the IDSS framework proposed in Chapter 3. Early data mining activities conducted across the fetal-maternal data are also described in this Chapter. This thesis concludes with Chapter 6 presenting research conclusions and suggesting future research directions resulting from this Masters(Hons) research. 20 CHAPTER 2 LITERATURE REVIEW 2.1 INTRODUCTION This literature review considers the research areas that contribute to the framework for IDSS clinical practice and research. The scope of the literature review includes: • Knowledge Discovery in Data(KDD) • Knowledge Discovery in Data in medical domains • Clinical practice and clinical research • Intelligent Decision Support Systems Particular attention is focussed on the open issues within these research areas, the impact on this thesis and future research challenges. 2.2 KNOWLEDGE DISCOVERY IN DATA (KDD) FRAMEWORKS 2.2.1 General Process of KDD Since the 1960’s database technologies have been evolving from initial primitive file processing systems to sophisticated and powerful database systems, thus generating an abundance of data. Human comprehension is insufficient to analyse these vast volumes of data. Society finds itself described as ‘data rich but information poor’. Knowledge discovery in data(KDD) is a concept born from the need to make better use of the vast volumes of stored data, retrospective patient medical data is the particular focus of this research. Han and Kamber (2001) use the following figures to illustrate the ‘data rich information poor’ situation and contrast to the search for knowledge from within the data. 21 Figure 2.1: ‘We are data rich, but information poor’. (Han & Kamber, 2001) The goal of KDD is to search for knowledge contained within the data. Han and Kamber (2001) offer the figure Y to depict the reorganisation of data and discovery of knowledge from within the data. Figure 2.2: ‘Searching for knowledge (interesting patterns) in your data’. (Han & Kamber, 2001) The abbreviation KDD is used by Han and Kamber (2001) and Mackinnon and Glick (1999) to abbreviate the phrase ‘Knowledge Discovery in Databases’. In other spaces the abbreviation is more broadly ‘Knowledge Discovery in Data’ which is the preferred term for this research thesis. 22 Some practitioners and researchers, including Marakas (2002b), suggest that data mining is a synonym for KDD. Alternatively, others (Han & Kamber, 2001), including myself view data mining as an essential step in the process of KDD. Han and Kamber (2001) illustrate the iterative knowledge discovery process in the following figure: Figure 2.3: Data mining as a step in the process of knowledge discovery. (Han & Kamber, 2001) Other sections of this thesis have described the databases, cleaning and integration, data warehouse, selection and transformation and evaluation of the KDD process. This part of the literature review focuses on data mining aspects. Reviewing this diagram and considering the broader requirements necessary to support data mining, such as data preparation, it is clear that data mining is in fact a step within KDD. Data mining and KDD are not interchangeable terms for the same activity. 23 The general process of KDD has been researched and published by a wide variety of researchers/ research groups including: 1. Marakas (2002b) provides guidelines for the overall data mining process: • select a topic for study • identify the target data set(s) • clean and pre-process the data • build a data model • mine the data • interpret and refine • predict • share the model 2. Roiger and Geatz (2003) suggest a very similar KDD process model: • goal identification • creating a target data set • data pre-processing • data transformation • data mining(a best model for representing the data is created by applying one or more data mining algorithms) • interpretation and evaluation • taking action 3. The Cross-Industry Standard Process for Data Mining(CRISP-DM) (CRISP-DM, 2004)) includes the following, domain independent steps in data mining. 1. Business Understanding 2. Data Understanding and Data Preparation 3. Modelling (actual work of data mining – specific hypotheses are tested or automated discovery methods are run) 24 4. Evaluation 5. Deployment The general processing described, in these three examples, provides an iterative framework from which to explore further data mining and knowledge discovery open issues. It is important to note that data mining is only one step in the general KDD processing, however, it has received by far the most research attention. Each of these examples specifies a data pre-processing or data preparation phase. Section 2.3.2.1 and 2.3.2.2 of this thesis considered extractiontransformation-loading frameworks and data warehouse architectures. The discussion from these sections is directly relevant to the data preparation stages identified in the above 3 KDD process examples. In addition, researchers including Han and Kamber (2001), have described real-world data that are often incomplete, inconsistent and noisy. Data preprocessing includes: • data cleaning, routines to fill in missing values, smooth noisy data, identify outliers and correct data inconsistencies • data integration, routines to combine data from multiple sources • data transformation, convert data into appropriate forms for mining • data reduction, routines used to reduce th representation of data while minimising the loss of information content. Example techniques include: data cube aggregation, dimension reduction, data compression. Researchers such as Vassiliadis, Simitiss and Skiadopoulos (2002) work towards improving Extraction-Transformation-Loading(ETL) processing to reduce the estimated 80% of data warehouse development time that is commonly committed to data preparation. These researchers state: 25 Thus, it is apparent that the design, development and deployment of ETL processes, which is currently performed in an ad-hoc, inhouse fashion, needs modelling, design and methodological foundations. Unfortunately the research community has a lot of work to do to confront this shortcoming. (Vassiliadis et al., 2002) These researchers begin by suggesting a conceptual model, that is generic in nature, to accommodate ETL processes. This is early research and there remain many unanswered questions and issues for further work. The presence of missing or incomplete data is commonplace in large realworld datasets. Xintao et al. (2002) state: Missing values occur for a variety of reasons eg. omissions in the data entry process, confusing questions in the data gathering process, sensor malfunction and so on. (Xintao, Wu, Daniel, & Barbara, 2002) These researchers go on to present linear algebra and constraint programming techniques to learn the missing values using apriori-known summary information and that derived from raw data. Learning missing data is essential for populating data warehouses to enable data mining, this presents as an open research area. 2.2.2 Data Mining Han and Kamber (2001)offer the following summary of the evolution of database technology leading to the field of data mining (DM). • 1960's and earlier o primitive file processing • 1970's – early 1980's 26 o hierarchical, network and relational databases o data modelling tools eg entity-relationship diagrams o indexing and data organisation techniques eg B+tree, hashing o query languages eg SQL o user interfaces, forms and reports o query processing and optimisation o transaction management eg recovery, concurrency control o On-line transaction processing(OLTP) • Mid 1980's – present o advanced data models eg extended-relational, objectoriented,object-relational, deductive o application orientated eg spatial, temporal, multimedia, active knowledge bases • Late 1980's – present o data warehouse and OLAP technology o data mining and knowledge discovery • 1990's – present o XML based database systems o web mining Han and Kamber (2001) go on to describe data mining as a step in the knowledge discovery process and provide a definition of the term data mining: Data mining is the process of discovering interesting knowledge from large amounts of data stored either in databases, data warehouses or other information repositories. (Han & Kamber, 2001) 27 George Marakas (2002b) also provides a similar, high level definition of the term data mining: Data Mining (DM) is the set of activities used to find new, hidden or unexpected patterns in data.(Marakas, 2002b). A more informative definition of data mining comes from Roiger and Geatz (2003): Data mining is an induction-based learning strategy that builds models to identify hidden patterns in data. A model created by a data mining algorithm is a conceptual generalisation of the data. The generalisation may be in the form of a tree, a network, an equation or a set of rules.(Roiger & Geatz,2003) Data mining incorporates techniques from many disciplines including: statistics, machine learning(Zorman, Kokol, Lenic, Povalej, Stiglic & Flisar, 2003), high-performance computing, pattern recognition, neural networks, data visualisation, image and signal processing and spatial data analysis. Cabena, Hadjinian, Stadler, Verhees & Zanasi (1997) present four main data mining operations and commonly used techniques: 1 Data Mining Operation Data Mining Techniques Predictive Modelling Classification Value prediction 2 Database segmentation Demographic clustering Neural clustering 3 Link analysis Association discovery Sequential pattern discovery Similar time sequence discovery 4 Deviation detection Statistics Visualization Table 2.1: Data mining operations and techniques (Cabena et al., 1997) Data mining can be conducted on a variety of different information repositories including, but not limited to: relational databases (Goodwin et 28 al. , 2000; Goodwin et al. 2001; Mackinnon & Glick, 1999); data warehouses; transactional databases; object oriented databases; objectrelational databases; spatial databases; temporal databases; text databases and multimedia databases and the World Wide Web and streaming data.(Babcock, Babu, Datar, Motwani & Widom, 2002; Qiao, Agrawal & Abbadi, 2003) A broad set of discipline areas are utilized when conducting data mining, as illustrated by Han and Kamber (2001) in the following diagram: Database Technology Statistics Data Mining Information Science Visualisation Machine Learning Other Disciplines Figure 2.4: Fields contributing to Data Mining (Han & Kamber 2001) These researchers suggest five primitives for specifying a data mining task: 1. specification of data set to be mined. Users specify the database and tables or data warehouse and data cubes containing data to be mined. Conditions for selecting and grouping and attributes(or dimensions) to be analysed when mining. 2. kind of knowledge to be mined eg. characterization, discrimination, association, classification or prediction 3. background knowledge, often in the form of concept hierarchies. Concept hierarchies express discovered patterns in concise, high-level terms and differing levels of abstraction. For example the following diagram, from Han and Kamber (2001) illustrates a concept hierarchy for attribute age: 29 Level 0 all Level 1 young Level 2 20…39 middle-aged senior 40…59 60…89 Concept hierarchy for attribute age. (Han & Kamber, 2001) Techniques such as binning, histogram analysis, cluster analysis and segmentation by natural partitioning may be used for automatic generation of concept hierarchies. 4. Interestingness measures 5. Knowledge presentation and visualisation techniques which are used to display the discovered patterns This literature review exploratory/descriptive continues with a consideration of versus confirmatory/predictive data mining. The distinction between (1) motivations and (2) operations of these two broad categories of data mining are fundamental to the innovative methodology presented in this thesis. Chapter 3 considers the application of these two data mining approaches when researching on patient medical data within a Scientific Method framework. Data mining systems can be classified according to the kinds of databases mined, kinds of knowledge mined, data mining techniques used or the type of applications adapted. (Han & Kamber, 2001) In addition, data mining can be classified as either ‘exploratory/ descriptive’’ ‘confirmatory/predictive’. Descriptive mining tasks characterize the general properties of the data in the database. Predictive mining tasks perform inference on the current data in order to make predictions. (Han & Kamber, 2001) 30 or 2.2.3 Exploratory/Descriptive Data Mining Exploratory or descriptive data mining assists an analyst in exploring the data searching for interesting patterns. The analyst conducting exploratory data mining uses domain knowledge, skill and intuition to guide the exploratory data mining. This exploration is quite subjective and lacks statistical rigour using post-hoc analyses. Any resulting statistical inference, p-values, are a guide only. Association Rule Mining involves showing data attribute value conditions that occur frequently in a given set of data. (Han & Kamber, 2001) These rules provide a starting point for data exploration and they are a popular tool in exploratory data mining. Association rules are of the form X⇒Y, that is "A1 ^ …^ Am→B1 ^ …^Bn", where Ai (for i∈ {1,…,m}) and Bj (for j∈ {1,…,n}) are attribute-value pairs. The association rule X⇒Y is interpreted as "database tuples that satisfy the conditions in X are also likely to satisfy the conditions in Y" Example(Han & Kamber, 2001): contains(T,"computer")⇒ contains(T,"software") [support=1%, confidence=50%] meaning that if a transaction, T, contains computer", there is a 50% chance that it contains "software" as well and 1% of all transactions contain both. 31 Only a single repeating attribute or predicate, contains, is used in this example. Association rules that contain a single predicate are called singledimensional association rules. The above rule can also be written as: "computer ⇒ software [1%, 50%]" Any association rules arising from exploratory data mining can not be used for prediction without careful analysis and consideration by domain experts. The research reported by Ohsaki (Ohsaki, Sato, Kitaguchi, Yokoi & Yamaguchi, 2004, 2005) considers automated approaches to determining the value and ‘interestingness’ of association rules generated across medical datasets. Association rules do not necessarily indicate causation. Clearly this warning must be heeded when data mining across medical data. This is an important fundamental concept that underlies the extensions to CRISP-DM described in Chapter 4 of this thesis. An interesting association rule mining research project is that of Becquet, Jeudy, Boulicaut and Gandrillon (2002) who applied association rule mining to gene-expression data analysis and found the results are complementary to existing gene-expression clustering techniques. Association rules and data mining with a focus on hospital infection control was researched and reported in 1998 in the Journal of American Medical Informatics Association (Brossette et al., 1998). 2.2.4 Confirmatory/Predictive Data Mining Confirmatory data mining is objective with hypotheses and analyses planned a priori a resulting p-values more meaningful than those generated via exploratory data mining. A hypothesis or model formulated in exploratory data mining is tested using confirmatory data mining techniques. Confirmatory or predictive data mining is used to indicate an expected result based on facts contained within the mined data source. Clearly the output of 32 Association rule mining can be included in the predictive data mining category, but the warning regarding causation in section 2.2.3 above must be heeded, particularly in a medical domain. 2.2.5 Classification and Prediction Han and Kamber (Han & Kamber, 2001) introduce classification: Classification is the process of finding a set of models (or functions) that describe and distinguish data classes or concepts, for the purpose of being able to use the model to predict the class of objects whose label in unknown. (Han & Kamber, 2001) Classification and prediction data mining generates a derived model based on a set of objects whose class label is known, called training data. The class label of data objects can be predicted using classification. The data mining derived models may take many forms including: o decision trees, flow-chart-like tree structure, each node denotes a test on an attribute value each lower branch represents the outcome of test and tree leaves represent classes or class distributions. o classification(IF-THEN) rules, decision trees are easily converted into classification rules o mathematical formulae o neural networks, collection of neuron-like processing units with weighted connections between units. Prediction is the term used when the actual class data values are to be predicted rather than just the class labels. Prediction can also include distribution trends based on available historical data. Relevance analysis is an activity that often precedes classification and prediction. There are attributes found within source data that do not contribute to the classification or prediction process. Relevance analysis involves identifying these attributes and omitting them from the classification and prediction process. 33 2.2.6 CRISP-DM The Cross Industry Standard Process – Data Mining (CRISP-DM) is a process model that includes the CRISP-DM methodology, reference model and user guide. This process model has been developed by a group of industry based data miners to be non-proprietary and freely available and was initially funded by the European Commission. CRISP-DM has, as it’s initiators dreamed, been broadly accepted as a sound fundamental of data mining activities as referenced in (Cabena et al., 1997; Roddick et al., 2003). CRISP-DM has six general phases as illustrated in the following diagram: Figure 2.6: CRISP-DM phases (CRISP-DM, 2004) The CRISP-DM methodology is described as a hierarchical process model, consisting of sets of tasks described at four levels of abstraction (from general to specific): phase, generic task, specialised task and process instance. (CRISP-DM, 2004) At the highest level the six phases are: (1) business understanding (2) data understanding (3) data preparation (4) modelling 34 (5) evaluation (6) deployment Each of these phases has associated with it a set of sub tasks that are spread across the lower levels of the hierarchical model as illustrated in the diagram below, moving from generic to more specialised tasks. CRISP Process Model Mapping CRISP Process Figure 2.7: Four Level breakdown of CRISP-DM methodology (CRISP-DM, 2004) Generic tasks are those that must be completed for all data mining situations such as cleaning and reformatting data. Such tasks are completed for all data mining activities including retail, financial, ant-terrorism and medical scenarios. The third level of specialized tasks describes in more detail the particular requirements, such as cleaning numeric, text or streaming data. The lowest level, the process instance, holds a record of the execution and results of a particular data mining instance. 35 2.2.7 Critics Data Mining activities are viewed with scepticism by some researchers. This issue has been raised in many publications including the work of Mackinnon and Glick (1999) who warn of situations where statisticians have used the term data mining to denote unsavoury ‘data dredging’ or ‘fishing expeditions’ in search of publishable P-values. The identification and acknowledgement of such concerns moved my research in a direction that sought an acceptable framework within which to conduct data mining activities in a medical domain. The result is the improved CRISP-DM methodology that acknowledges the rigour of the Scientific Method and places data mining activities into this rigour as detailed in Chapter 3. Han and Kamber (Han & Kamber, 2001) also warn that there are many 'data mining systems' available in the commercial market, however many can not perform the advanced activities needed for a genuine data mining. These researchers also warn against the inappropriate use of all-purpose data mining systems: Different applications often require the integration of application-specific methods. Therefore, a generic, all-purpose data mining system may not fit domain-specific mining tasks. (Han & Kamber, 2001) Marakas (2002b) summarizes the limitations of data mining as: Identification of missing information. Future systems need to include mechanisms for 'inventorying' the dataset to determine sufficiency of attributes for DM. Data noise and missing values. Noise is the difference between a model and its predictions. Data is referred to as ‘noisy’ when it contains missing, incorrect values or extraneous columns. (Connolly & Begg, 2005) DM systems use statistical techniques to deal with noise. These techniques rely on known distributions of data noise. 36 Future systems must incorporate more sophisticated mechanisms for treating missing or noisy data. These two limitations represent ongoing open research areas. 2.2.8 Data Mining in a Medical Domain Data mining in domains such as market basket analysis, missing person’s analysis, sales forecasting and customer management and retention exist in a very different domain to that of fetal-maternal and broader medical environments. Later sections of this thesis consider, in detail, the implications of data mining in the medical arena where data mining is conducted in the null hypothesis driven medical domain. There are other issues that require special consideration when conducting data mining activities in the medical domain, including but not limited to: ethics, data availability, data quality and evidence based medicine demands. Recent KDD research in the medical domain has exposed the importance of close collaboration between knowledge engineers and clinicians due to large, complex, heterogeneous, hierarchical, time-varying and quality-varying datasets (Goodwin et al., 2001; Roddick et al., 2003). The following issues emerge from current research as key matters for consideration when data mining in the medical domain as described by Roddick (2003): o Investigative Method o Rule Interpretation o Working with a Considerable Knowledge Base o Data Availability and o Accuracy and Ethical Safeguards Finding new patterns in medical data is a driving force behind KDD in medical domain (Goodwin et al., 2001; Podgorelec et al., 2002; Roddick et al., 2003). Innovative algorithms are constantly sought in KDD, however 37 there is an emerging view amongst some DM/KDD researchers that future research focus must be on tackling the use of previously described 'real world' data rather than carefully constructed test data sets that confirm relative efficiency of algorithms (Webb, Han & Fayyad, 2004). 2.2.9 Conclusions and Implications for this Research When applying the CRISP-DM methodology in the fetal-maternal domain, and the broader medical domain, an over-arching research process or investigative method must drive the activities in the lower three layers of the CRISP-DM hierarchy. This is an important unaddressed, open research question. Any data mining carried out on medical datasets will be of restricted value in the medical field if it has not been produced using a rigorous scientific-method driven approach. This conclusion and the implications this has for future data mining in the medical domain is one of the strongest influences on the concepts developed for this research thesis. The research of Roddick et al. (2003) has particularly influenced my research. 2.3 INTELLIGENT DECISION SUPPORT SYSTEMS (IDSS) This section of the literature review begins with a brief introduction to the components of a generic Decision Support System (DSS). Open research issues are highlighted, with particular focus on issues that may impact the medical domain, and the impact on this research is considered. 2.3.1 Evolution of Intelligent Decision Support Systems Decision making has been the focus of substantial research as documented in texts and papers including those of Anthony (1965) and Little (1970). Marakas (2002a) states that the Decision Support System (DSS) concept arose from the problems faced daily by organisational decision makers. In the 1970’s two influential papers written by J.D. Little (1970) and Gorry and 38 Scott Morton (Gorry & Scott Morton, 1971) provided the genus for modern DSS. Little recognised that managers needed a Model-based set of procedures for processing data and judgements to assist a manager in his decision making. (Little, 1970) Gorry and Scott Morton (1971) introduced the term decision support system and developed a two dimensional framework for computer based support of management decision making. The dimensions of the framework include continuous dimensions with classification of decision structure on the vertical dimension and levels of managerial activity on the horizontal dimension. This framework extended the previous classification of decision structure, proposed by Simon (1960), and the managerial activity level of Anthony (1965). Decisions made within the fetal-maternal domain, and broader medical domain, largely fall into the semi-structured category which, according to Gorry and Scott Morton (1971), benefit from DSS support. Marakas (2002a) defines DSS in the following way: A decision support system is a system under the control of one or more decision makers that assists in the activity of decision making by providing an organized set of tools intended to impose structure on portions of the decision-making situation and to improve the ultimate effectiveness of the decision outcome (Marakas, 2002a). Other researchers such as Mallach (2000) and Turban (Turban & Aronson, 2001) and have provided various definitions for DSS which generally abide by similar themes to that incorporated in Marakas’s definition above. 39 Marakas (2002a) presents a table to summarise the common types of support provided by DSS, numbers have been added to Marakas’ list to aid future reference in this literature review. Common type of support provided by DSS Explores multiple perspectives of a decision context Generates multiple and higher quality alternatives for consideration Explores and tests multiple problem-solving strategies Facilitates brain storming and other creative problem solving techniques Explores multiple analysis scenarios for a given decision context Provides guidance and reduction of debilitating biases and inappropriate heuristics Increases decision makers ability to tackle complex problems Improves response time of decision maker Discourages premature decision making and alternative selection Provides control over multiple and disparate sources of data 1 2 3 4 5 6 7 8 9 10 Table 2.2: Common types of DSS support. Marakas (2002a) presents some characteristics common to most DSS applications: o Employed in semi-structured or unstructured decision contexts o Intended to support decision makers rather than replace them o Supports all phases of the decision making process o Focuses on the effectiveness of the decision-making process rather than its efficiency o Is under control of the DSS user o Uses underlying data and models o Facilitates learning on the part of the decision maker o Is interactive and user friendly o Is generally developed using an evolutionary, iterative process o Provides support for all levels of management from top executives to line managers o Can provide support for multiple independent or interdependent decisions 40 o Provides support for individual, group and team-based decision making contexts As the concept of a DSS evolved from its inception in the 1970’s to the present day numerous DSS variations have emerged including knowledgebased systems, artificial intelligence, expert systems, data visualisation systems, executive information systems and group support systems. Marakas (2002a) generally classifies components of a DSS into five distinct parts: 1. The data management system 2. The model management system 3. The knowledge engine 4. The user interface 5. The user(s) The data management component stores, retrieves and organises the DSS data. Several subsystems make up this data management component including the physical database(s), database management system and query facility. In addition DSS security functions, data integrity and data administration procedures are provided by the data management component. The quantitative models and analytical capabilities of the DSS are provided by the model management system. The model base, model base management system, model execution and synthesis processors make up the model management system. The model base in a DSS is the modelling counterpart to the database. Decision models offer a simplified representation of reality and may be broadly classified as Abstract or Conceptual decision models. Abstract models include deterministic, stocabilistic, simulation and domainspecific models. Deterministic models ensure that no variable can take more than one value at any given time thus, the same output values will result from a given set of input variables. Stocabilistic models have at least one uncertain variable in the model which must be described by some probability 41 function. Simulation models allow the testing of various outcomes and by comparing the results the decision maker (DSS user) can determine the most desirable alternative. The knowledge engine provides the “brains” of the DSS. (Marakas, 2002a) The knowledge base houses the domain-specific knowledge including the rules, heuristics, boundaries, constraints, previous outcomes and other information necessary for the problem domain. Elicitation of domain knowledge for use in knowledge bases is an open research area and this thesis presents a methodology for using existing historical patient records to generate fetal-maternal domain ‘rules’. Continuing with Marakas’ “brains” and “brawn” metaphor for an expert system, the inference engine provides the “brawn”. The inference engine (IE) puts the knowledge to work to produce solutions. The IE operates on a 3 phase control cycle: (1) match rules with given facts (2) select the rule to be executed (3) execute the rule by adding the deducted fact to the working memory. Operation of the IE is driven by deductive inference known as modus ponens which states “if A is true and A implies B is true, then B is true.” The counterpart rule modus tollens states “if A implies B is true, and that B is false, then we can conclude that A is also false”. The user interface provides the means by which the DSS user works with the data, model and processing components of the system. The users are an essential component of the DSS: Without considering the user as part of the system we are left with a set of computer based components that, by themselves, provide no useful function at all. (Marakas, 2002a) Recent Decision Support System Research is focussing on the use of the DSS in a wide variety of problem domains including Australian farming systems (Robinson, 2005) and cotton industry (Johnson, 2004) and Zeleznikow and Nolan’s generic work in uncertain domains. (2001) 42 2.3.2 IDSS Components A typical IDSS architecture as presented by Turban and Aronson (2001)illustrated in Figure 2, below. The inclusion of the optional knowledge management and model management adds the ‘intelligence’ factor to the definition of DSS provided in earlier sections of the literature review. Some aspects of IDSS that fall beyond the scope of this research have been excluded, for example user interface and report generator. Other Computer-based System Data: external and internal Model Management Data Management Knowledge Management User Interface DSS Manager(user) Figure 2.8: Turban and Aronson’s IDSS. (2001) Heath and McGregor (2004) presents an expanded IDSS framework divided into five zones of interest : 1. Extraction-Transformation and Loading Frameworks 2. Data Warehouse Architectures 3. Data Mining/Knowledge Discovery in Data Frameworks 4. Knowledge Base Architecture 5. Model Base Architecture 43 ETL Frameworks Data Warehouse Architecture Data Source IDSS Data Management Data Mining /Knowledge Discovery in Data Frameworks Data Warehouse (DW) / Mart Extract Transform Load (ETL) Process Domain Ontology Knowledge Base Knowledge Base Architecture Model Base Model Base Architecture Figure 2.9: Heath and McGregor’s five zones of interest. (Heath & McGregor, 2004) The following sections of this literature review consider each of these zones and document open research areas within each zone and across zones. 2.3.2.1 Extraction-Transformation-Loading(ETL) Frameworks Research within Academia and industry has not focused on issues within ETL (Xintao, Wu, Daniel, & Barbara, 2002). ETL involves approximately 80% of effort required to prepare a Data Warehouse for Decision Support and Data Mining Activities (Inmon, 2002; Xintao, Wu, Daniel, & Barbara, 2002). ETL has received a fraction of attention from the Research community. This is partly explained by the uncertainty and dubious nature of 'real world' data that dominates this zone of interest. Exploration of data mining algorithms, machine learning, rule induction, neural networks, Bayesian techniques, Model Base and Knowledge Base construction and exploitation as found in IDSS required 'clean', clearly defined data sets. Generic activities often included in ETL are: 44 o Translating coded values, source systems may use I and O for inpatients and outpatients and the data warehouse may use 1 for inpatients and 2 for outpatients. o Encoding free-form values, patient disposition may be recorded as ‘composed’, ‘calm’ or ‘relaxed’ and mapped to a numeric value of 1. o Deriving new calculated values o Light or heavy summarization of data o Addition of surrogate keys for use in DW rather than using natural keys. Some knowledge domains generate more ideal data for ETL, for example sensor data in medical or engineering domains, however many transactional datasets created by data source (DS) are not of suitable quality for inclusion in DW and subsequent IDSS (Berndt et al., 2001). Improved definition of ETL activities and provision of formal foundations for their conceptual representation is an open research issue (Vassiliadis et al., 2002). 2.3.2.2 Data Warehouse Architectures Data Warehouses have been exploited to manage persistent data, a current research challenge is the functional support of transient data in continuous data streams (McGregor, Bryan, Curry & Tracey, 2002; Qiao et al., 2003; Yu, 2004). This is of particular interest in the medical domain given the pervasive use of physiological sensor devices (McGregor et al., 2002). An increasing volume of data needed for decision-making is stored in XML data format. Researchers (Golfarelli, Rizzi & Vrdoljak, 2001) propose a semiautomatic approach for building the conceptual schema for a data mart starting from the XML sources. Current research (Hummer, Bauer & Harde, 2003) introduces XCube, a family of XML based document templates, to exchange data warehouse data to enable integration for creation of vendor independent virtual or federated data warehouses. Future research lies with the combination of XCube with the new Web Services paradigm (Hummer et al., 2003). Some research (Miquel & Tchounikine, 2002) explores pushing the boundaries of data commonly associated with data warehouses. These 45 researchers propose an integrated system that can store traditional data, provided by OLTP systems and raw data acquired with specialized electronic measurement devices, such as standard electrocardiogram. In addition, they suggest a process database that stores software components to realize 'on-the-fly' data transformation. Clinical Data Warehouses pose fresh challenges to researchers including: need for complex data modeling features, advanced temporal support, advanced classification structures, continuously valued data, dimensionally reduced data and the integration of very complex data (Pedersen & Jensen, 1998). Domain ontologies are explicit specifications of how knowledge in a domain is conceptualized (Gruber, 1992). Elicitation of domain knowledge is a critical aspect of IDSS development. Domain ontologies are frequently associated with research directed towards the semantic Web. Domain ontologies have been included in the data warehouse architectures section of this literature review because in Chapter 4 of this thesis, a domain ontology forms part of the recommended IDSS architecture for use in the fetalmaternal domain. In addition the domain knowledge, elicited when creating the ontology, is crucial within the IDSS Knowledge Base and for use with query and reporting tools in the IDSS Data Management layer. Gahleitner, Behrendt, Palkoska and Weippi (2005) state : Ontology building is still much more of a craft rather than an engineering discipline. Each development team usually follows its own set of principles, design criteria and phases. (Gahleitner et al., 2005) Little research had followed on from the early recognition of a need for systematic approach to domain knowledge elicitation. Some researchers have dismissed attempts to directly involve domain experts as both time consuming and unreliable (Jung & Budivada, 1995). However in 2005 a number of publications have reported a research focus on the establishment of automated and semi-automated methods to build domain ontologies from 46 existing systems and documentation (Amardeilh, Laublet & Minel, 2005; Gahleitner et al., 2005; Hayes, Reichherzer & Mehrotra, 2005; Sabou, Wroe, Goble & Mishne, 2005; Upadhyaya & Kumar, 2005). These researchers agree that currently the creation of high-quality ontologies is a very time consuming and expensive activity and suggest that overcoming this issue is the motivation behind their research. The work of Amardeilh et al. (2005) has focused on the semi-automatic creation of ontologies from textual documents. The semantic annotation and ontology population are dependent on the knowledge captured in the documents by the domain experts. That research addressing use of textual documents is of interest because much fetal-maternal domain knowledge is captured in textual documents. As the approach proposed by Amardeilh et al. (2005) matures it should be possible to employ in the proposed fetalmaternal IDSS framework to assist in elicitation of domain knowledge and construction of a domain ontology. Extracting ontologies from extended entity-relationship diagrams is under research as described by Upadhaya and Kumar (2005). These methods and resulting domain ontologies have not yet gained the confidence of some user communities – of particular interest for this research is the medical community. Research is continuing on the automated creation of ontologies of particular note is the current research into a formal approach and automated tool for translating entity-relationship schemata into Web ontology and semantic markup language(OWL)(Xu, Cao, Dong & Wenping, 2004). Related ontology research explores ontology-learning and instancedata migration from databases. Two ongoing projects (Beckett, 2004; Gomez-Perez, 2004)are developing tools for creating ontologies and ontological instances from databases. Gene ontologies have been successfully constructed and exploited by the life sciences communities to assist in providing transparent access to integrated scientific resources, of note is the Transparent Access to Multiple Bioinformatics Information Sources(TAMBIS) project (Boucelma, Castano 47 & Goble, 2002)and UTS/Westmead Children’s Hospital research (Kennedy et al., 2004). Similar capture of domain knowledge, in form of an ontology, will enable greater exploitation of existing DS in other domains. Innovative future research needs to overcome the above mentioned obstacles to ontology development to allow broader development and adoption of ontologies to facilitate all aspects of IDSS. 2.3.2.3 Knowledge Discovery in Data / Data Mining Frameworks These are an important part of the framework and have been elaborated on in section 2.3 of this literature review. 2.3.2.4 Knowledge Base Architectures Elicitation of domain knowledge for inclusion in IDSS Knowledge Bases is an open area of research that has not captured much attention from the research community (Boose, 1985) for similar reasons to that which hinder ontology creation. Marakas (2002a) introduces the concept of people called knowledge engineers who interview domain experts and gather the information necessary for an IDSS knowledge base. In some domains such an approach is suitable, for example collecting knowledge relating to the reasoning required to determine if a potential banking customer is suitable to take on a loan. The knowledge engineers utilize knowledge acquisition techniques such as interviewing, protocol analysis and modelling. The very complex nature of knowledge within the medical domain precludes this labour intensive approach to knowledge acquisition. Semi-automatic methods are current areas of research interest. Rule induction from medical datasets is an ongoing, open research area (Podgorelec et al., 2002). These researchers conclude that the suggested semi-automatic approach to knowledge discovery, with physician assessment, resulted in a good outcome. The outcome was a set of automatically induced rules which the physicians assessed as mostly correct and reliable. Physician acceptance of the Knowledge Base in any medical 48 IDSS is a challenge that will need to be met by future researchers to ensure acceptance of IDSS (Cesnik, 2002). This research describes a gap between the knowledge repositories and the instantiation of such knowledge in IDSS. 2.3.2.5 Model Base Architectures The model base provides the financial, forecasting, management science and other quantitative functions that provide analysis capabilities to the DSS (Turban & Aronson, 2001). Typically the model component of an IDSS is composed of: the model base; model base management system; modelling language; model directory and model execution, integration and command processor. Common business functions such as allocating and controlling organisational resources are freely available. Some models specific to the fetal-maternal domain have been developed elsewhere, such as Summons et al (1999) work on fetal gestation estimation. However, the development of suitable models for the analysis required in the fetal-maternal domain remains an open research area. 2.3.3 Conclusions and Implications for this Research The IDSS arena provides many opportunities for research – most of which are beyond the scope of this thesis. The elements of open research that will be carried forward in this thesis are those relating to the elicitation of domain knowledge for establishment of (1) domain ontologies and (2) knowledge bases for use in fetal-maternal IDSS. The benefits of establishing a data warehouse are substantial and have influenced the framework for IDSS in fetal-maternal domain delivering a consolidated, transformed, clean fetal-maternal dataset. The practical aspects of this are explored in the fetal-maternal case study in Chapter 5. A brief comment on the importance of the users of DSS has been included in this literature review. The brevity of this inclusion does not indicate a low priority being placed on users within this research. In fact, it is the 49 acceptance and confidence of clinician users that influences the manner in which the knowledge base is established. To ensure user acceptance of any IDSS it is essential that the fetal-maternal knowledge base, and broader medical knowledge bases, be produced via a sound scientifically based methodology as emphasised in the Electronic Decision Support for Australia’s Health Sector Report (Australian Health Information Council, 2003). 2.4 IDSS for Clinical Practice and Research The Generic Application Foundations and The Base Management Subsystem described in the research presented by Yu and Chien (2004) are aimed at supporting consumers during all phases of on-line purchase. The development of similar foundations and subsystems for medical IDSS presents an open research issue that has not attracted substantial attention from the research community. This may be due to lack of financial incentives and complex nature of medical Base Management Subsystems and the demands of evidence based medicine that must inform knowledge, model and past case base components of the IDSS subsystems. (Cesnik, 2002) 2.4.1 IDSS for the medical domain. Warren and Stanek (2005) consider the use of DSS in medical domains indicating that health, like any industry, makes use of DSS for efficient organisational management. These researchers emphasis that: Where DSS for health become special, however, is in support for clinical-decision making… A Clinical DSS provides patient-specific healthcare advice. (Warren & Stanek, 2005) With a particular interest in the use of DSS within a medical domain, Warren and Stanek (2005) classify DSS into the following major categories: 50 o Data presentation/visualisation tools o Support decision maker by rearranging existing data for easier assimilation. o Problem solving by search o Searches an existing database using particular parameters in a search query eg. ‘Given drug A and drug B, search an interaction table and return all entries involving A and B’ (Warren & Stanek, 2005). o Case based reasoning o Simulates reasoning by experience, attempts to match current parameters with those held in the systems knowledge which is embodied in a library of past cases. o Symbolic reasoning systems o Typically consists of a knowledge base, inference engine and current dynamic data currently being processed. Knowledge stored in the knowledge base is formalised, frequently as a rule, for example: If[fasting plasma glucose >7.0 mmol/l] then [infer the Dx: Diabetes mellitus] (Warren & Stanek,, 2005) where Dx is the diagnosis. o Artificial neural networks o Designed to be analogous to the neurological function of the human brain. Uses a training data set with known outcomes, to build a structure capable of making classifications or predictions describing the problem to be decided. o Simulation modelling tools 51 o Uses models to create a simplified representation of the real world to predict outcomes or explain observed behaviours. 2.4.2 Unrealized Potential for IDSS in the Medical Domain. In the April 2005 issue of the British Medical Journal, Kawamoto, Houlihan, Balas and Lobach (2005) listed 15 features of DSS used in a clinical setting and gave an example of each. The purpose of that research was to identify which of these Clinical Decision Support System (CDSS) features assisted in the provision of improved patient care. The 15 features and examples are repeated here to enable a comparison to be drawn between what Marakas (2002a) presents as types of support that can be provided by DSS, Table 1, and the types of DSS support currently utilized in clinical practice. (Kawamoto et al., 2005) There is a ‘gap’ between the support a DSS can potentially provide and the current manner in which DSS are utilized in the medical domain. This research thesis identifies this unrealized potential and presents a methodology that assists in closing the ‘gap’ drawing clinicians beyond just relying on IDSS and DSS for monitoring tasks and moves into ‘what-if’ type analysis through establishment of solid knowledge bases established through sound scientific methods. 1 2 3 DSS Feature General system features Integration with charting or order entry system to support workflow integration Use of computer to generate the decision support Clinician-system interaction features Automatic provision of 52 Example Preventive care reminders attached to patient charts Patients overdue for ovarian cancer screening identified by querying a clinical database rather than by manual chart audits Diabetes care recommendations decision support as part of clinician workflow 4 No need for additional clinician data entry 5 Request documentation of the reason for not following CDSS recommendations 6 Provision of decision support at time and location of decision making 7 Recommendations executed by noting agreement 8 Communication content features Provision of a recommendation, not just an assessment 9 Promotion of action rather than inaction 10 Justification support via reasoning of decision provision of 11 Justification of decision 53 printed on paper form and attached to relevant patient charts by clinic support staff, so that clinicians do not need to seek out the advice of the CDSS Electronic or manual chart audits are conducted to obtain all information necessary for determining whether a child needs immunisations If a clinician does not provide influenza vaccine recommended by CDSS, the clinician is asked to justify the decision with a reason such as “The patient refused” or “I disagree with the recommendation” Preventive care recommendations provided as chart reminders during an encounter, rather than as monthly reports listing all patients in need of services. Computerised physician order entry system recommends peak and trough drug concentrations in response to an order for aminoglycoside, and the clinician simply click ‘Okay” to order the recommended tests System recommends that the clinician prescribes antidepressants for a patient rather than simply identifying patient as being depressed. System recommends an alternative view for an abdominal radiograph that is unlikely to be of diagnostic value, rather than recommending that the order for the radiograph be cancelled. Recommendation for diabetic foot exam justified by noting date of last exam and recommended frequency of testing Recommendation for diabetic support via provision research evidence 12 of Auxiliary features Local user involvement in development process 13 Provision of decision support results to patients as well as providers 14 CDSS accompanied by periodic performance feedback 15 CDSS accompanied conventional education by foot exam justified by providing data from randomised controlled trials that show benefits of conducting the exam. System design finalised after testing prototypes with representatives from targeted clinician user group As well as providing chart reminders for clinicians, CDSS generates postcards that are sent to patients to inform them of overdue preventive care services Clinicians are sent emails every 2 weeks that summarise their compliance with CDSS recommendations for the care of patients with diabetes Deployment of CDSS aimed at reducing unnecessary ordering of abdominal radiographs is accompanied by a ‘grand rounds’ presentation on appropriate indications for ordering such radiographs Table 2.3: Features of Clinical Decision Support Systems(CDSS) important for CDSS effectiveness. (Kawamoto et al., 2005) Table 2.3, above, resulted from Kawamoto et al. (2005) literature searches via Medline, CINAHL and the Cochrane Controlled Trials Register. Seventy studies were included in the review. The auxiliary features 12-15 inclusive are not unique to decision support systems. Features 12, 14 and 15 are ideal concepts for use in all system development and implementations. Feature 13 is also not specific to a DSS, this is simply the provision of additional outputs to patient recipients. Such a feature has great value in the medical domain but does not require specific DSS support. A number of the features include the use of a diary / date minding type functionality which can be provided by many generic scheduling applications, again this does not call on specialist DSS functionality. Coupled with the date reminders are the recommended care/treatment protocols for particular conditions. A number 54 of the features listed note the monitoring of clinicians adherence to the treatment protocols and highlight any variance. These features are also not calling on the specialist capabilities of DSS, as described by Marakas in Table 2.1. (Marakas, 2002a) The information system is storing recommended treatment protocols, such as diabetes and ovarian cancer preventive, and referencing these to provide reminders such as features 1, 2, 3 and 5. Databases holding patient history are routinely queried to identify patients who meet a particular criteria, such as that described in features 2, 10 and 14. Structured Query Language(SQL) queries such as these are readily run against relational databases without the need for a DSS. Features 8 – 11 inclusive are moving into the DSS domain as described by many researchers.(Cesnik, 2002; Mallach, 2000; Marakas, 2002a; Turban & Aronson, 2001; Warren & Stanek, 2005). Features 8 and 9 are describing the DSS making at least one recommendation for action and these features could be condensed to one feature. Features 10 and 11 are crucial features, unique to a clinical DSS and can not be omitted. Given the nature of evidence based medicine these features are essential to enable the clinician to ‘know’ what knowledge the CDSS is using in creating a recommendation. The need for such easy access to evidence was highlighted by Cesnik (2002) in mapping a future direction for DSS in Australia to assist in building clinicians confidence in CDSS and acceptance. Discussions at the Medical DSS in the 21st Century conference in Sydney, October 2003 also indicated that clinicians were thinking of DSS as a tool to assist in prevention of incorrect medication usage and reminders for treatment protocols rather more sophisticated multi-dimensional analysis and ‘what-if’, scenario type analysis, options generating multiple scenario analysis tool. Comparing Table 2.1 with the features in Table 2.2 it is possible to identify the open areas in DSS within a medical domain that have not yet been widely explored. These include: 55 o Exploring multiple perspectives of a decision context o Generating multiple and higher quality alternatives for consideration o Exploring and testing multiple problem-solving strategies o Facilitating brainstorming and other creative problem-solving techniques o Exploring multiple scenarios for a given decision context 2.4.3 DxPlain – a medical IDSS from the United States of America. The DxPlain (DXPlain, 2006) IDSS developed at the Laboratory of Computer Science at Massachusetts General Hospital uses a modified form of Bayesian Logic to derive clinical interpretations. DxPlain accepts a set of clinical findings including signs, symptoms and laboratory data for a specific patient. DxPlain then produces a ranked list of diagnoses that may explain or be associated with the clinical manifestations. Of particular importance, given the comments regarding Features 10 and 11 in the work of Kawamoto et al. (2005) above, DxPlain provides justification for why each of the diseases might be considered and suggest what additional clinical information could be useful to collect. Unusual or atypical clinical manifestations are also listed for specific diseases. The knowledge base (KB) component of the DxPlain IDSS includes over 2200 diseases and in excess of 4900 clinical findings ie. symptoms, signs, epidemiologic data, laboratory, endoscopic and radiologic findings. The average disease description includes 53 findings. Each disease/finding pair has two numbers describing the relationship. The first indicates the frequency with which the finding occurs in the disease and the other the degree to which the presence of the finding suggests possibility of the disease. There are more than 230,000 individual data points in the knowledge base depicting disease/finding relationships. An additional value in the range 1-5 indicates how important it is to explain the presence of each finding. This value is independent of any particular disease. 56 Each disease has two associated values: one indicates its prevalence as very common, common, rare or very rare. The other value indicates the importance, ranked between 1 and 5, and attempts to reflect the impact of not considering the disease if it is present. DxPlain grew from a stand alone application with a knowledge base of 500 diseases into a web-based version which can be subscribed to by Hospitals, Medical Schools and Healthcare organisations. Many years work has been involved in the establishment of this DSS. An extended CRISP-DM applied over historical patient data would allow for rapid generation of disease, treatment and clinical findings from SWSAHS fetal-maternal data sets to populate a similar knowledge base, using the innovative methodology presented in this thesis. DxPlain is considered here because it contains a large dataset which has been refined over many years by clinicians. DxPlain presents as a good example of a medical IDSS which has been persistently used by an ever widening group of clinicians. 2.4.4 Medical IDSS in the Australian Context In the Australian context, the Australian Health Information Council (AHIC) provides advice to Health Ministers on how information management and information and communication technology (IM&ICT) effort can be harnessed to address current and emerging needs in health care delivery, management and planning. This council recognised the need to research the use of DSS in the Australian Health sector and formed the Electronic Decision Support(EDSS) sub group. This group define Electronic Decision Support as: Electronic Decision Support is access to knowledge stored electronically to aid patients, carers, and service providers in making decisions on health care. In November 2002 the findings of their research were published in a national report Electronic Decision Support for Australia’s Health Sector (Australian 57 Health Information Council, 2003) and the Electronic Decision Support Steering committee manages and gives direction to the recommendations made in this report. This report considers the following: o A definition of electronic decision support o Status of electronic decision support system implementation o Evidence for benefits of using electronic decision support systems o The needs of clinicians and health information industry o Barriers of successful implementation of EDSS Of particular interest to my research are the barriers to successful implementation of EDSS. These barriers are generalised and listed as: 1. Concerns about quality and safety aspects of the systems 2. Gaining the acceptance of health professionals 3. Implementation issues 4. Level of investment required From barrier (1) it is clear that one of the main areas of concern is the content of the underlying knowledge base used in the EDSS. Specifically concerns regarding whether knowledge bases have been translated accurately into electronic form, whether they are based on medical evidence, whether they are peer reviews and whether there have been trial to test the ‘rules’. Barrier(2) raises confidence in the knowledge base again as a possible barrier to clinician acceptance if EDSS. To quote the report: They(clinicians) expect that the knowledge within such systems must match that of the most trusted experts within each area of clinical practice. They require accurate translation of the knowledge base into an electronic format. Clinicians argue that many systems are limited because of the quality of the data 58 entered and the failure to reflect local patient mix and practice patterns. (Australian Health Information Council, 2003) Discussions with my research partners at the Liverpool fetal-maternal unit have re-iterated these particular concerns. Elicitation of domain knowledge for inclusion in a knowledge base is an open research area. Inference of fetal-maternal rules from the extensive data set held at Liverpool helps to address the problem highlighted above regarding failure to reflect local patient mix barriers. It is interesting to note that in a strict scientific research approach the limitation of inferred rules to only a subset of the broader population could be considered a weakness. Interestingly clinicians are calling for local focus – as described in the Australian national report above. All of these issues influence the conduct of my research, however (1) and (2) have been considered and the proposed methodology described in Chapter 4 of this thesis is motivated by a desire to mitigate negative impacts of clincians concerns about quality, safety and overall acceptance by users of the fetal-maternal IDSS. Need for guidelines to assist novices in evaluating EDSS. These guidelines have been produced by The Centre for Health Informatics Research at the University of New South Wales. (Australian Health Information Council, 2003). These guidelines describe the knowledge content of an EDSS as being two components: the Inference Engine and Knowledge Base as illustrated below and referred to previously in this thesis in section 2.3.2.4 59 Figure 2.10: A general model of an EDSS The guidelines continue explaining that the content of the knowledge base is a specific representation of the knowledge and recommendations in a particular clinical area eg. recommending treatment for thyroid disorder or a type of cancer. The inclusion if medical treatment protocols/guidelines are considered to be necessary inclusions in the knowledge base. The knowledge base contents are generally unreadable by humans and stored in a format understood by the inference engine. The inference engine draws logical conclusions using a particular method of reasoning. These guidelines also emphasise the importance of using a reliable and valid knowledge source to ensure compliance with evidence –based best practice leading to improved quality, safety and consistency of care. These EDSS guidelines published by UNSW Centre for Health Informatics Research emphasis that conversion of the knowledge source into executable content should be appropriately supervised. This is to ensure that no medical domain knowledge is lost or accidentally changed in any way. When multiple knowledge sources have been merged into a single knowledge source – ie. the knowledge base within the EDSS, there is a possibility that the different sources will make conflicting recommendations regarding patient treatments/protocols. Carefully considered procedures must be in place during conversion of knowledge source into executable content to minimize the impact of the adverse factors described above. Conversion of the medical knowledge source into an executable content, for use in the inference engine, should conform to a standardised methodology. These UNSW researchers acknowledge that there are few applicable standards in the area and encourage the use of a documented and auditable methodology. Establishment of such a standard methodology is clearly an important open research area. 60 My thesis pursues the development of an evidence-based process for deriving knowledge source rules from the fetal-maternal domain for use by the inference engine in the knowledge base component of the proposed IDSS for fetal-maternal clinical practice and research. Chapter 4 of this thesis describes my proposed methodology which answers the UNSW researchers call for adequate clinical input and review and use of a documented, auditable methodology during knowledge conversion into executable format. 2.4.5 Conclusions and Implications for this Research The review conducted by Kawamoto et al. (2005) that inclusion of IDSS within clinical workflows is important for efficiency and system acceptance. IDSS, such as DxPlain, are not integrated within the clinician’s workflow they are separate reference type systems that are not pervasive and immediately available for use with the protocol monitoring and schedule monitoring favoured in Kawamoto’s features. Integrating these currently disparate components of IDSS is an open research area. The IDSS framework described in this research provides an opportunity to improve this situation by integrating patient data from online transaction processing systems with a knowledge base (similar concept to DxPlain) to facilitate clinical research and practice. 61 2.5 EXPERIMENT DESIGN AND CLINICAL TRIALS. Gross-Portney and Watkins (2000) describe the ultimate purpose of health professionals to be the development of a knowledge base that will maximise the effectiveness of health practice. The evidence-based medicine practice is a fundamental principle underlying health care which aims to ensure that choices regarding patients are made based on evidence that has been confirmed by sound scientific data. The British Medical Journal (Sackett, Rosenberg, Muir Gray, Haynes & Scott-Richardson, 1996) defines evidencebased medicine as: The conscientious, explicit and judicious use of current best evidence in making decisions about the care of individual patients. Efforts to establish medical evidence include Clinical Research, defined by Gross-Portney and Watkins (2000): Clinical Research is a structured process of investigating facts and theories and exploring connections … examining clinical conditions and outcomes to establish relationships among clinical phenomena, to generate evidence for decision making and to provide the impetus for improving methods of practice. The Scientific Method guides Clinical Research and is illustrated by Anderson (2006) in the following schematic: 62 Figure 2.11: The Scientific Method (Anderson, 2006) 63 The principle of the scientific method involves a cycle of observationhypothesis-experiment. Carey (1994) offers a brief description: In a nutshell, then, we can say that scientific method is a rigorous process whereby new ideas about how some part of the natural world works are put to the test (Carey, 1994) The following is a brief description of the Scientific Method bringing together concepts and explanations from a variety of sources (Anderson, 2006; Carey, 1994; Corsi & Weindling, 1983). I acknowledge that debate continues in the scientific community regarding the scientific method, philosophy, theology and sociology. I also acknowledge that there is not one step-by-step recipe(Carey, 1994) followed by scientists in their research. However, Figure 2.11 does represent a broadly held positivist view of the scientific method which is appropriate for this research. The scientific method begins with observations of phenomena. To investigate further a testable hypothesis is developed in an attempt to explain the observations. The veracity of this hypothesis is tested via further experimentation or observations. If the hypothesis can not be verified to an acceptable level of ‘certainty’ an iterative action follows with a refinement of the hypothesis and retesting. If many tests of the hypothesis are repeatable, and found to support the hypothesis, the importance of the work can be raised to the level of a widely held theory. The potential exists for the theory to then face scrutiny and progress to the status of a ‘law’. The purpose of experimental design is to provide a mechanism for assessing the cause-and-effect relationship between a set of dependent and independent variables. Experimental design can be depicted as existing on a continuum as illustrated below: 64 DESCRIPTIVE EXPERIMENTAL Describe Populations EXPLORATORY Find Relationships Cause and Effect Figure 2.12: Continuum of Experiment Design (Gross-Portney & Watkins, 2000) A clinical trial is the most common type of experimental design in epidemiology and provides the strongest evidence for cause and effect. Clinical trials are prospective studies comparing the effect of an intervention against a control. Piantadosi (1997) presents: a clinical trial is an experiment testing medical treatments on human subjects. Randomized clinical trials are the most ideal of experiment designs. GrossPortney and Watkins (2000) describe clinical research experimental designs including Non-equivalent post test-only control group design. This is a quasi-experimental design that uses existing patient groups. Investigative groups can be drawn from existing clinical patient data. This experiment design is particularly useful when ethical concerns preclude a true control group or when investigating a rare condition or event. Patients treated in a fetal-maternal domain generally present with unusual conditions not often found in the broader patient group. Establishment of a clinical trial surrounding these unusual cases may not be possible – even if a multi-centre approach is adopted. The difficulty in using the randomized method for rare conditions relates to the need for a large enough ‘n’ to provide the power to find statistically 65 significant differences between study groups. This is not possible if there are few cases to include in a study. The research presented by Lord, Genski and Keech (2004) discusses the issues of using clinical trial data to support other research investigations. As described above, the establishment of a clinical trial is not always possible. An emerging open research area involves addressing the challenge of conducting meaningful, sound data analysis on existing data collected during patient treatment. The research presented by Lord et al. (2004) does not consider the foundation scientific method but rather assumes that clinical trials are the only vehicle for such data analysis. Gross-Portney and Watkins (2000) state the Historical Research approach to research is weak: Because sources, measurements and organisation of data are not controlled, cause-and-effect statements cannot be made in historical research. (Gross-Portney & Watkins, 2000) The application of data mining techniques, in the analysis of the historical patient data, may provide an opportunity to move this experimental design from quasi status towards a stronger cause and effect experiment design. Some research has combined data mining with retrospective studies of patient searching for prognostic markers including Ji, Naguib and Ghoneim (2003) and Goodwin et al. (2000; 2001) and Hagland (2004) reporting the work of Dr Eric Bremer in Pediatric Brain Tumor research. The extensions to CRISP-DM, as described in Chapter 4, facilitate such an innovative approach. It should be noted that the use of IDSS and KDD in healthcare organisations has largely focussed on organisational issues (Alexandrini et al., 2003; Rao et al., 2003) rather than exploration of patient historical data records. I see this as another example of IDSS and KDD being used to improve the 66 commercial position of these organisations rather than advancing medical knowledge. 2.5.1 Dominant Medical Research Paradigm The Null hypothesis – from the Latin nullus, meaning ‘not any’, states that there is no difference between the comparison groups and is usually established to be disproved rather than proved. In a clinical setting the null hypothesis expects no clinical effect or difference beyond chance difference (Graziano & Raulin, 2003; Piantadosi, 1997). If a statistically significant difference is found then the null hypothesis must be rejected. If the differences are within chance limits, the null hypothesis is NOT rejected. (Graziano & Raulin, 2003) Rejecting the null hypothesis is not sufficient to draw a causal inference between variables as factors other than the independent variable may impact on the dependent variable under investigation. Confounding variables – factors other than the independent variable may have an effect on the dependent variable. Roddick (2003) refers to the impact the null hypothesis paradigm has on the typical processes adopted in knowledge discovery in data(KDD) and indicates this is an open area for research. The enhancement of CRISP-DM, respecting needs of evidence-based medicine and strength of null hypothesis paradigm underpins the new process described in Chapter 4. 2.5.2 Clinical Reasoning, Statistical Reasoning and KDD Clinical and statistical reasoning converge in clinical research (Piantadosi, 1997) and interestingly Mathews (1995) describes resistance to use of numerical comparison and statistical methods in evaluating therapeutic efficacy. In clinical research, empirical knowledge comes from observations and data and theory based knowledge comes from established biology or hypothesis. Similarly, statistics uses empirical knowledge drawn from data or observations and theory based knowledge of probability and determinism which has been formalized in mathematical models (Piantadosi, 1997). Well established statistical techniques exploited in clinical research include 67 regression, generalized linear models, regression trees, time series and analysis of variance (Han &Kamber, 2001). Figure 2.13: Convergence of Clinical and Statistical Reasoning Paradigm The KDD process, including the data mining sub-process, has been widely researched and documented (Han & Kamber, 2001; Inmon, 2002; Mackinnon & Glick 1999; Marakas, 2002b; Masuda & Sakamoto, 2002; McCarthy, 2000; Roddick et al., 2003; Roiger & Geatz, 2003). Extending the convergence of clinical research and statistics paradigm described above, KDD is an additional, converging dimension that will increase the power of clinical research that exploits historical data. This convergence is introduced and explored throughout this thesis. This is an open research area that remains unaddressed by other researchers. 68 Figure 2.14: Addition of KDD to Reasoning Paradigm 2.5.3 Conclusions and Implications for Research KDD and data mining enhance the traditional statistical techniques by bringing theoretical foundations and techniques of data mining including machine learning, neural networks, association mining and clustering. Application of these principles to Gross-Portney and Watkins (2000) Nonequivalent post test-only control group experiment design has not been widely documented or researched and presents an open research area. In addition, modification of established KDD methodologies to support null hypothesis in clinical research is an open KDD research area that is addressed in this thesis via the extended CRISP-DM methodology to be employed in the data management component of IDSS. 69 CHAPTER 3 CROSS INDUSTRY STANDARD PROCESS – DATA MINING (CRISP-DM) SPECIALIZED TASKS FOR MEDICAL RESEARCH. 3.1 Introduction This Chapter addresses Hypothesis 1 of this thesis by demonstrating that the CRISP-DM can be extended to enable its use in medical research driven by the null hypothesis paradigm. Clinical trials are the current favoured approach to sound medical research. These trials involve control groups and intervention groups and all associated factors are strictly controlled and monitored. The Clinical Trials are formulated and conducted in such a way as to disprove a particular null hypothesis, thus proving a particular alternate hypothesis. This design is considered to be a strong experimental design from which cause-and-effect relationships between factors can be stated. The patients treated in a fetal-maternal environment present with multiple complex, rare conditions. The construction of a rigorous clinical trial – even a multi-centre clinical trial – is more difficult to organise than say a clinical trial for study of lung cancer patients, due to the small number of patients presenting with these fetal-maternal conditions. This leads us to interpretation of existing clinical data to support hypothesis held in the fetal-maternal environment. Retrospective, historical studies have traditionally not held up to the rigorous analysis possible through a prospective, clinical trial. Such studies are considered quasi-experimental designs. This thesis proposes that by adopting an improved CRISP-DM that specifically addresses the requirements of null hypothesis analysis, it is possible for a retrospective, historical analysis to move from the arena of quasi-experimental design towards the well regarded true experimental design. 70 The knowledge thus generated, in an electronic format, can be used to populate the knowledge component in an IDSS when a rule generating data mining technique is chosen in the modelling phase of CRISP-DM. 3.2 Extended CRISP-DM enhancing Data Management Layer of IDSS This thesis proposes extensions to the CRISP-DM. These new concepts are to be applied within the Data Management Layer of the proposed IDSS architecture, as highlighted below. ETL Frameworks Data Source Data Source Extract Transform Load (ETL) Process Data Source Data Warehouse Architectures IDSS Data Management Data Warehouse (DW) / Mart Domain Ontology Knowledge Base Knowledge Base Architecture Model Base Model Base Architecture Figure 3.1: Framework for IDSS, data management highlight 71 Data Mining /Knowledge Discovery in Data Frameworks 3.3 ‘Outputs’ from Data Mining are ‘Inputs’ to the Knowledge Base of IDSS The domain knowledge, generated by KDD activities, are applied in the Knowledge Base component of the IDSS Architecture, as highlighted below. ETL Frameworks Data Source Extract Transform Load (ETL) Process Data Source Data Source Data Warehouse Architectures IDSS Data Management Data Warehouse (DW) / Mart Domain Ontology Data Mining /Knowledge Discovery in Data Frameworks Knowledge Base Knowledge Base Architecture Model Base Model Base Architecture Figure 3.2: Framework for IDSS, knowledge base highlight 3.4 CRISP-DM Specialised Tasks to Support medical Research The Literature review section of this thesis introduced both the Scientific Method and the CRISP-DM. I propose in this thesis that the Scientific Methods principle of observation-hypothesis-experiment rests comfortably with the CRISP-DM with discovery data mining supporting the observationhypothesis phase and confirmatory data mining supporting the hypothesisexperiment. The following diagram illustrates my demonstration of the parallelism between the Scientific Method and CRISP-DM which is very applicable in the fetal-maternal (and other) medical domain. Other researchers have not explored such parallelism previously and this is a unique aspect of this thesis. Discussion of the switch from exploratory data mining to confirmatory data mining is explored in the remainder of this section. 72 % " ! " ! # ! ! # " ! # Data Source $ " Figure 3.3: Parallelism between CRISP-DM and the Scientific Method 73 The diagram above illustrates the convergence I perceive in the various concepts of CRISP-DM, exploratory and confirmatory data mining and the scientific method. This comparison has not been drawn in existing research. I have grouped the CRISP-DM and scientific method elements into 5 common phases and indicated the iterative nature of each and exposed the scope for use of exploratory and confirmatory data mining. The null hypothesis medical research paradigm, as discussed in literature review, must be the driver behind the confirmatory data mining and associated analysis for the scientific method. 3.4.1 Initial Phase The purpose of the initial phase of CRISP-DM involves developing an understanding of the business/organisational operation for the domain to be investigated. Once an overall appreciation of the business/organisation is established a more detailed analysis of the organisational data is undertaken. There is no set timeframe specified for the initial phase of the CRISP-DM. Organisational staff may, over a couple of years, months or weeks, observe trends or exceptional circumstances in business operation that generate particular datasets. Alternatively data analysts may be invited to use exploratory data mining techniques to quickly establish some overall business understanding and more detailed data understanding. Both of these approaches identify aspects of organisation operation, from production to sales, which could benefit from further analysis. The initial phase of the scientific method begins with an observation of phenomena. Similar to the initial phase for CRISP-DM there is no established timeframe for this initial phase. Clinicians may over a period of years, months or weeks observe trends or exceptional circumstances regarding patients that generate particular datasets. From these observations a hypothesis maybe formulated to explain the observations. Inviting data analysts to use exploratory data mining techniques to establish some overall 74 appreciation of computerised patient data is, as yet, not a widely accepted element of the initial phase of the scientific method. As noted in the literature review of this thesis the nature of medical data is challenging due to the demands of: working with a considerable knowledge base; data availability and clinical data accuracy problems. It is in this initial phase and the following data preparation phase that such issues must be dealt with in using the scientific method for clinical research. It is in this initial phase that the clincians formulate an alternate hypothesis and generate an associated null hypothesis which will be used in the testing phase, once the intermediate data preparation phase is complete. Exploratory data mining takes many variables/attributes/factors into consideration using a variety of techniques in search for systematic patterns. The results and outcomes from the exploratory data mining are weak until they are confirmed in the confirmatory data mining phase of the methodology illustrated in Figure 3.3 above. Exploratory data mining approaches provide essential information for consideration in the initial and data preparation phases for both the CRISP-DM and scientific method. Exploratory data mining techniques are not used for null hypothesis testing – this is done using confirmatory data mining techniques in the testing phase. 3.4.2 Data Preparation Phase The data preparation phase of CRISP-DM aims to prepare the datasets to be used for modelling or the major analysis of the project. Attributes and rows are selected from source database management systems. Not all data contained within a business system are required – the scope is determined by the business phenomena to be investigated. As noted in the literature review of this thesis data preparation is an expensive time consuming aspect of KDD and targeting particular data elements increases the efficiency of this phase of CRISP-DM. The data quality is raised to the level required by the analysis techniques to be used in the modelling within the following testing phase. This may involve cleaning of data, inserting suitable default values or 75 estimating of missing data, transforming data, integrating data and syntactic modifications. The data preparation phase for the scientific method varies depending on the type of study to be undertaken viz randomised, prospective, correlation or retrospective. This thesis focuses on using exiting patient data therefore the activities in the data preparation phase for correlation and retrospective studies are very similar to those described above for CRISP-DM data preparation. However, it is noted that ethical issues relating to patient confidentiality and privacy add a level of complexity to the data preparation phase for scientific method application in a medical domain. 3.4.3 Testing Phase The testing phase for CRISP-DM involves modelling using a specific modelling technique such as C4.5, neural network creation or decision tree building. These models are not generated in an ad-hoc manner they are selected and applied to the prepared datasets in response to the targeted area of business operation identified in the initial phase. The models quality and validity are first tested prior to application across the broader dataset. The prepared organisational data is split into a test and train data set. The model is built using the training set and the quality of the model is estimated using the test dataset. In the scientific domain traditional hypothesis testing aims to verify a priori hypotheses about relations between variables/attributes/factors. An example of such a hypothesis may be “ There is a positive correlation between maternal age and occurrence of down syndrome is newborns”. Exploratory data mining is used to identify relationships between variables/attributes/factors when there are incomplete or non-existent a priori expectations. This is conducted in the initial phase of the methodology presented in Figure 3.3. Ideally the exploratory data mining results should be cross validated using a different data set or an independent subset of the original patient data drawn from the data warehouse. This is done in the testing phase. In addition to testing the null hypothesis it is also important to 76 test the predictive validity of any association rules identified in the exploratory data mining phase. 3.4.4 Assessment Phase Figure 3.3 has the traditional CRISP-DM evaluation step contained in the assessment phase. CRISP-DM evaluation involves assessment of data mining results checking for pertinence to initial business KDD objectives. CRISP-DM calls for an evaluation of the DM outputs to determine the next steps – does the process return to the initial phase and amend KDD scope for further iterations or proceed to the deployment within the usage phase. This is the crucial phase for assessing the strength of the null hypothesis. The CRISP-DM evaluation step must be expanded in this assessment phase for medical domain applications. The CRISP-DM guidelines are, as intended, generic in nature with usefulness across most domains. For use in the fetal-maternal domain, and medical domain more broadly, this assessment phase must be strengthened with the traditional null hypothesis evaluation regarding statistical significance and data bias. Confirmatory data mining techniques and traditional statistical techniques support this assessment phase for the scientific method. If the null hypothesis can not be disproved then the alternate hypothesis can not be supported and, as indicated in Figure 3.3, a return to hypothesis formulation is required. This mirrors the evaluation of data mining models in CRISP-DM forcing a return to the initial phase if needed. 3.4.5 Usage Phase CRISP-DM deployment involves using positive evaluation results and deploying, monitoring and maintaining data mining results in the business workplace. The usage phase for the medical domain is more complex as the demands of evidence based medicine practice require ratification of results prior to use 77 in clinical practice. It is the demands of this ratification – starting with the investigative method – that acted as a catalyst for this research. 3.4.6 The role of exploratory and confirmatory data mining The tools used for exploratory data mining can also be used in the confirmatory data mining phase, however the manner in which they are employed is quite different. For example the C5.0 algorithm can be used in the exploratory phase and in the confirmatory phase. In the exploratory phase the intuition and experience of the clinician guides the selection of attributes from the data warehouse for use in the data management layer where the C5.0 is executed. The clinician has insight into the patient data set and thus can reduce thousands of possible attributes for use in the C5.0, down to a more manageable quantity. Recall that one of the characteristics of medical patient data is the large number of dimensions, ie. potential attributes/factors for consideration. Some researchers interested in the scientific methodology may say that the use of clinicians to ‘whittle’ down the factors to be considered by the C5.0 is an abuse of the method. Other researchers believe that close co-operation between the clinician and data analyst facilitates efficient CRISP-DM operation. Despite the validity of both these opinions I believe that it is necessary for the data analyst to at least confer with the clinician regarding factors to target in initial exploratory investigations. In an ideal world, with unlimited resources, we could run the C5.0 across all permutations and combinations of data attributes/factors, however in reality this opportunity is unlikely to present. A generic example is created here to illustrate. A clinical data warehouse may contain 1000 attributes, A1, A2, A3 … A1000. An initial clinician review of these attributes may highlight a subset of attributes to be investigated using exploratory data mining techniques, this subset is B1, B2, B3 … B200. Identification of patterns, clusters and previously unrealised relationships between 200 factors is a computational task beyond human ability, therefore, the use of exploratory data mining techniques remains valid. Continuing with the example, currently available data mining 78 tools, such as Clementine 8.0, facilitate the import of the 200 factors from the data warehouse. Using the filtering function available in Clementine 8.0 it is possible to readily ‘explore’ the selected data set. It is important to have a data mining tool suited to rapid iteration through exploratory data mining processes, rapid display of results is essential. Consideration of the relative efficiencies of various data mining algorithms is outside the scope of this thesis. Through the exploration process the C5.0 algorithm, or other suitable algorithm of choice, provides valuable insights into the data set. Close cooperation must be maintained between the data analyst and clinicians at this iterative, exploratory stage of the data mining methodology. Clinicians can expect to see predictable relationships identified by the data mining algorithm. Continuing with the generic example the C5.0 algorithm may return the following type of output … B5 = X [Mode:J] (500) B78 = Y[Mode:K] (300) B196 > 25 (210) B196 <= 39[Mode :J] => P (90, 0.634) B196 > 39 [Mode : J] (100) Visualisation is another data mining tool that is useful during the exploratory data mining phase of the new methodology. The output from the C5.0 algorithm is not readily interpreted by the clinicians. The advantages of data visualisation are well documented and this tool is well suited to exploring large patient data sets. If the methodology is being effectively utilized, during the exploratory data mining phase the clinicians should formulate some hypotheses that they wish to consider more closely. This is the stage when focus shifts from exploratory data mining to confirmatory data mining, as illustrated in Figure 3.3 above, hypotheses are to be tested with confirmatory techniques. As previously stated the data mining tools used in exploratory work may also be used in the confirmatory work – however the manner in which they are 79 employed is different. In the free-ranging exploratory phase the C5.0 was used to randomly draw together various factors for consideration. The start of the confirmatory phase occurs when clinicians formulate an alternative hypothesis and move to a null hypothesis, as discussed in the literature review of this thesis, refer section 2.6.1. With the clear statement of a null hypothesis we must start with a fresh set of factors for consideration in the C5.0 algorithm. Any factors/attributes not related to the stated null hypothesis must be removed from consideration. We are left with the factors/attributes of concern to the null hypothesis. In this confirmatory phase traditional statistical techniques are combined with the data mining algorithms to measure the ‘strength’ of the outputs from the data analysis. It is at this point that we can become engrossed in the issues surrounding the generation of clean data sets prepared specifically for analysis purposes. Research abounds regarding the necessary quality demanded of patient datasets to be used for confirmatory analyses. Section 2.6 of the literature review in this thesis addresses these matters and identifies this as a potential open research area. The innovative methodology I present here acknowledges these demands and proposes that by adopting an improved CRISP-DM that specifically addresses the requirements of null hypothesis analysis, it is possible for a retrospective, historical analysis to move from the arena of quasi-experimental design towards the well regarded true experimental design. The following figure illustrates the additional layers added to the CRISP-DM to accommodate the needs of Clinical Practice and Research. 80 4.2.1 & New Process ' ( ) ' ( " & , * - . +$ Extensions For Medical Data Mining As proposed by this thesis * / % $ % Figure 3.4: 81 The Data Mining Rule Set Generation Layer is an exploratory data mining layer. The rules generated from this layer are reviewed by a suitable Clinician and Significant Rule Sets are Selected for further analysis. The Clinician and data analyst work together to Formulate an Appropriate Null Hypothesis which is carried forward into the confirmatory data mining processes in the Run Statistical Process to Test Null Hypothesis Layer. Ideally this would be conducted on a separate subset of confirmatory data to generate a statistically sound outcome eg. meaningful p-values. Following Clinician review and consideration of outcome from confirmatory data mining, appropriate rule sets are loaded into the IDSS Knowledge Base. 3.5 Generation of Electronic ‘Rules’ for use in IDSS knowledge base Clinicians have absolute control over the knowledge base rules that are added to the electronic knowledge base. These Clinicians have the improved retrospective study, including use of null hypothesis, to support the validity of generated rules. The elicitation of domain knowledge, in electronic format, for use in knowledge bases has been investigated by other researchers (Bench-Capon, Coenen, Nwana, Paton & Shave, 1993; Bench-Capon & Visser, 1997). These researchers state that if the conceptualisation is explicit the knowledge engineer has a framework to guide the acquisition of domain knowledge. The MEKAS knowledge acquisition methodology presented by BenchCapon et al. (1993) attempted to address the need for a systematic approach to domain knowledge elicitation by including an early stage where the domain is conceptualised. Little research has followed on from this early recognition of a need for systematic approach to domain knowledge elicitation. Researchers including Jung and Gudivada (1995) dismiss attempts to directly involve experts and moves on to automated methods which have not yet gained the confidence of the broad medical community: 82 Elicitation of medical knowledge required for determining relationships, directly from human experts, is both time consuming and unreliable. (Jung & Gudivada, 1995) The approach recommended in this thesis is an improvement on earlier elicitation methods because it begins with the accepted medical approach of stating an alternate hypothesis followed by a null hypothesis and proceeds to rejection or acceptance of the hypothesis. Ultimately, when appropriate, the methodology moves to establish electronic rule-sets for clinician review prior to populating the IDSS knowledge base with the generated rules. 83 CHAPTER 4 INTELLIGENT DECISION SUPPORT SYSTEMS (IDSS) FOR CLINICAL PRACTICE AND RESEARCH 4.1 Introduction The second hypothesis of this thesis is proven in this chapter as an Intelligent Decision Support System (IDSS) is defined for clinical practice and research including a data management component to exploit the extended CRISP-DM methodology. The following diagram illustrates the framework proposed for the IDSS to be used in the medical domain. The five zones of interest reflect the organisation of the literature review section 2.2 ETL Frameworks Data Source Data Source Extract Transform Load (ETL) Process Data Source Data Warehouse Architectures IDSS Data Management Data Warehouse (DW) / Mart Domain Ontology Data Mining /Knowledge Discovery in Data Frameworks Knowledge Base Knowledge Base Architecture Model Base Model Base Architecture 4.2 Extraction-Transformation and Loading (ETL) Frameworks The processes depicted in this framework for ETL are provided by commercially available software tools and supporting customised enterprise processes. As stated in section 2.3.2.1 formal foundations do not exist for conceptual representation of ETL activities. In the medical domain the data sources range from 3rd party, vendor applications such as General Electric ViewPoint to ad hoc, disposable spreadsheets. This IDSS framework proposal suggests using a commercial package for the ETL software. Functionality required for the ETL software includes: 84 o Graphical user interface o Formula parser and condition handling for use with a wide variety of data types including ASCII, numeric, date fields, logical/binary o Smart type casting o Debugger with trace and breakpoint capability o Explicit rollback and commit transaction handling o Field mapping interface o Scheduler to manage ongoing ETL without end user initiation (ETL Portal, 2006) This proposed IDSS framework recommends development of data extraction scripts for use with the more recent OLTP database systems currently is use across all medical units. These OLTP databases contain physiological and demographic datasets. In addition treatment and procedure details are held for every occasion of service for the registered patients. The structure of these OLTP databases is relatively stable and data continues to accumulate on a day-to-day basis. Development of custom ETL scripts, using commercial data management tools, is recommended as there is a reasonable rate-of-return on resources expended in development due to the life expectancy of such heavily used applications. Prior to the execution of the ETL scripts it is necessary for the clinicians to review the data and ‘clean it up’. Specifically, clinicians must identify missing data elements, correct data values that are clearly invalid and ensure that consistent units of measurement are used for the domain of each attribute. This aspect of the ETL framework is the most time consuming, as indicated by earlier, acknowledged research (Han et al., 1997; Inmon, 2002). Improved day-to-day procedures are also required to ensure that new patient data that is added to the OLTP is done in a correct manner. Awareness of the 85 importance of sound data recording tasks must be emphasised to the clinicians. A detailed project management plan is recommended for use by clinicians involved in this data preparation phase. This is recommended to assist the clinicians because they are required to use many different means by which to validate or collect missing data. For example: o midwives/clinicians may need to refer to the NSW Midwives database (or equivalent) to find information regarding pregnancy outcomes o it may be necessary to telephone the patients to determine / find other missing aspects of data concerning their treatment by the fetal-maternal unit o other hospital systems may need to be queried to recover patient demographic details or prior medical history These data cleanup and pre-processing tasks are most likely to be undertaken by clinicians who are also engaged in day-to-day treatment of patients, therefore careful planning and monitoring of DW related tasks is essential. It is not sufficient for administrative staff or computing staff to take on the data pre-processing activities as detailed domain knowledge is required. Very close collaboration between clinicians and IT/computing staff is required for the activities undertaken in the ETL frameworks component of the IDSS. 4.3 Data Warehousing Architectures The data warehouse architecture needed for the IDSS requires little in the way of summarization and as far as possible atomic, original data values should be used. Data is recorded and stored in OLTP databases at a fine level of granularity and this needs to be preserved in the data warehouse to assist in clinical research and practice. 86 The data is high dimensional data and for any given clinical practice analysis or clinical research a variety of on-the-fly calculations maybe required. These requirements are difficult to predict and thus using fine grain data in the data warehouse provides greatest flexibility for future clinical analysis. As described in the literature review of this thesis the data used in the medical domain is complex, heterogeneous, often contains multiple-domain hierarchies and is time-varying. 4.3.1 Core Dimensions for IDSS This section lists some sample core attributes that should be included in a medical IDSS. When clinicians choose to focus on particular research areas this core set of data warehouse attributes will be extended as necessary. Generic datatypes are provided, however these are dependant on each instantiation of the IDSS and may need to be varied. Codes used to indicate datatype are: Code DataType EC Encoded data eg. 1 for forceps, 2 for suction T Text BLOB Binary Large Object N Numeric B Boolean D Date DT Time X As required by IDSS instantiation Table 4.1: Datatype Coding Some tailoring for the Australian context has been included eg. Medicare number plus line number on medicare card is a unique social security number held by all eligible Australians. This is equivalent to Social Security Number in the United States. The Insurance details apply to some patients as Private Health Insurance is optional in Australian communities. 87 Moving from consideration of generic medical IDSS to instantiation in an Australian fetal-maternal domain requires consideration of the minimum Perinatal dataset in the Australian National Health Data Dictionary(ANHDD). At the time of writing the minimum perinatal national dataset does not include most of the terms used required in a fetal-maternal IDSS data warehouse. Application of the generic frameworks exposes particular challenges in a fetal-maternal domain, including: • Mothers can present on multiple occasions to the fetal-maternal unit, therefore a maternal patient identifier must be combined with an episode identifier to uniquely identify each patient pregnancy. Use of surrogate keys is recommended with the natural keys in the data warehouse of the fetal-maternal IDSS. • Frequently mothers present to the fetal-maternal unit carrying multiple fetuses. Each of the fetuses must be monitored during the initial and subsequent visits. This presents a challenge from a data management point of view because if two or more fetuses have the same sex it is difficult to distinguish between them at some stages during pregnancy. Each fetus is identified within the data warehouse using a concatenated key containing elements for Maternal Patient Id, Episode Id and Fetus Id. Clinicians make their best efforts to distinguish between the unborn fetuses but some times physiological measurements are transposed or otherwise inadvertently associated with an incorrect fetus. Ideally these errors would be detected and corrected in the OLTP databases prior to addition to the IDSS data warehouse. To further illustrate implementation challenges encountered when the generic frameworks are applied in the fetal-maternal domain refer to Appendix A. Appendix A contains a sample set of attributes to convey the complex, time-varying, multi-dimensional nature of the fetal-maternal data. It is estimated that only approximately 20% of the attributes commonly found in the databases that support fetal-maternal are listed in Appendix A. 88 4.3.2 Domain Ontology The construction and use of ontologies leads to tighter definitions of agreed semantics which is essential when using domain knowledge for Decision Support and Data Mining purposes. The literature review suggests a number of approaches to creation of domain ontologies. The generic OLTP databases should have technical documentation including the entityrelationship diagrams that describe entities, attributes, primary keys and relationships utilized with the DBMS. The ER diagrams may be conceptual or logical – either can be put to good use in ontology development. As Hayes et al (2005) highlight Concept Map construction is a proven method for explicating and communicating domain knowledge. The proposed generic IDSS framework includes a domain ontology to assist in comprehending complex data. As can be seen in the previous section and Appendix A, the data supporting the mother and fetus is complex, heterogeneous and time varying. The approach recommended for fetalmaternal IDSS ontology development uses a synthesis of two approaches: (1) Use of concept maps and (2) use of ER diagrams for generation of fetalmaternal ontology. 4.4 Data Mining / Knowledge Discovery in Data Frameworks The KDD framework must include, as a minimum, data mining tools suitable for the generation of knowledge rules. Rule-set generation is essential for implementation of the CRISP-DM extensions for clinical research as described in thesis sections 3.4 and 3.5. The KDD framework should ideally also have a user friendly software solution to assist clinicians to conduct exploratory data mining across patient data as proposed and discussed in thesis section 3.4. Application of the generic KDD framework, as described above, to the fetalmaternal domain requires the outputs from the KDD framework to be clinical evidence reporting. The generation of such clinical evidence is of 89 optimum value if a solid scientific discovery process underlies the reporting and associated fetal-maternal KDD framework. This research recognises the importance of (1) KDD frameworks and (2) scientific process as applicable in a generic manner and specifically within the fetal-maternal domain. 4.5 Knowledge Base Architectures The proposed framework for the IDSS recommends rule representation of knowledge as this format is particularly applicable where it is necessary to recommend a course of action based on observable events – such as a patient presenting with particular symptoms and clinicians requiring a treatment protocol. This also melds smoothly with the proposed extended CRISP-DM methodology, detailed in Chapter 3, which recommends rule-set generation in the extension layers resulting from data mining activities. The adoption of the extended CRISP-DM methodology and resultant rule representation also helps address the open research issue raised in Literature Review section 2.5.1 by University of New South Wales Health Informatics researchers requiring a documentable, auditable methodology for establishment of knowledge bases for use with inference engines. 4.6 Model Base Architectures The model base architectures must contain analytical models specifically developed for each problem domain. It is insufficient to only have generic strategic, tactical, operational and analytical models created for the broad business market. As an example, fetal-maternal models must be developed in close cooperation with fetal-maternal experts/clinicians. These models must be based on sound evidence based medicine practices. Specifying the specific models 90 to be developed is beyond the scope of this thesis however of interest would be models: o to predict maternal and/or fetal response to a particular treatment protocol eg. Intrauterine fetal transfusion of blood or platelets o to predict likely pregnancy outcome following chorionic villus sampling or other invasive procedure at variable weeks gestation o anticipating likelihood and type of chromosomal abnormality given maternal, paternal and fetal factors. 91 CHAPTER 5 ‘DATABABES’ CASE STUDY 5.1 Introduction The purpose of this thesis chapter is to demonstrate the framework introduced in Chapter 4 via a real world fetal-maternal case study. The goal is to demonstrate the IDSS framework for clinical practice and research including a data management component to exploit the extended CRISPDM. Fetal-maternal Medicine Units world wide provide care and treatment to mothers and their unborn children. The Fetal-maternal Medicine Unit(FMMU) at Liverpool Hospital, Sydney is part of the Sydney South West Area Health Service (SSWAHS) (Sydney South West Area Health Service, 2006). SSWAHS is a New South Wales Government funded health service for 1.3 million people. The Liverpool Hospital has been operating continuously since the end of the eighteenth century. Patient services offered by the Liverpool Hospital FMMU include: First Trimester o Ultrasound o First trimester ultrasound o First trimester screening – Nuchal Translucency(60%) with PAPP A /free Beta HCG(90%) o First trimester multiple pregnancy chorionicity assessment o Karyotyping Procedure o Chorionic villus sampling Second Trimester o Ultrasound o Transvaginal cervical assessment o Monochorionic twin monitoring for twin to twin transfusion o Maternal uterine artery Doppler o Maternal antibody ultrasounds and invasive monitoring of fetus o Invasive Procedures o Fetal blood sampling 92 o Amniocentesis o Chorionic Villus sampling Third Trimester o Growth ultrasound / placental localisation o Amniotic fluid index o Biophysical profile o Fetal Therapy o Intrauterine fetal transfusion of blood or platelets o Pigtail catheter insertion and drainage of fetal fluid o Amniodrainage Vast volumes of electronic data have been generated during the treatment of patients at the Liverpool Hospital FMMU. 24,000 patient records exist in two separate online transaction processing (OLTP) systems. These patient records include a large number of physiological measurements and clinical data for both the pregnant women and fetus(s) throughout the pregnancy. Initially the FMMU utilized a DOS based Fetal Database and then moved to the General Electric Windows based ViewPoint™ Fetal Database. These applications are designed to assist in the day-to-day operation of fetalmaternal and similar units. Such transaction processing information systems are optimised for data entry rather than data analysis operations as described by Inmon (2002) and other researchers (Devlin, 1997; Kimball, 1996; Marakas, 2002b). Fetal-maternal Clinicians at the Liverpool FMMU entered into discussions with me to investigate an improved approach to using the existing patient data for clinical research purposes. The disparate datasets were well suited to data warehousing and subsequent exploitation via a fetal-maternal IDSS. 5.2 Aim The case study aims to test whether data mining and supporting technology components can provide Fetal-maternal Medicine(FMM) Clinicians with an improved environment for patient data analysis. Clinicians within the FMMU held hypothesis regarding the correlations between multiple factors 93 within the OLTP system data. For example, the relationship between pregnancy outcome and the gauge of needle and type of needle used for transabdominal Chorionic Villus Sampling (CVS). This test would be considered successful if the outcomes were considered to improve the effectiveness of the clinician’s research analysis. An additional aim was generation of domain knowledge using rule generating algorithms 5.3 Methods Fetal-maternal Clinicians did not have the knowledge or experience to effectively consolidate the existing disparate patient data into a cohesive, well defined, clean data set ready for clinical analysis and research purposes. Numerous piecemeal attempts had been made using spreadsheets and manual paper-based methods. There was a clear need for custom data cleaning processes, a fetal-maternal data warehouse, standard statistical component, KDD/DM component and importantly a reporting component. During the Build phase of the constructivist research method it became clear that due to poor quality data, described further below, it would be necessary to re-define the requirements and limit the scope to Chorionic Villus Sampling (CVS) data only. This gave rise to a CVS data mart rather than a fully implemented data warehouse covering the scope of all maternal and fetal data. The DataBabes architecture includes the fundamentals of Knowledge Discovery in Data (KDD) architectural components which are required to facilitate CRISP-DM phases. Figure 5.1 below, illustrates disparate data sources, custom extraction, transformation and loading components, a data mart and proprietary data mining software tool, generating reports presenting various aspects of clinical evidence and discovered knowledge. Figure 5.1 also illustrates the use of a Chorionic Villus Sampling Domain Ontology. This ontology is populated by the data definitions and relationships suitable for use with the Chorionic Villus Sampling data mart. 94 Extraction Transformation Loading Components ViewPoint Database DOS legacy Fetal Data Warehouse Components " Database Extract Transform Load (ETL) Process Chorionic Villus Sampling Data Mart Data Mining /Knowledge Discovery in Data(KDD) Components Data Mining / KDD 0 Data Source 1 $ # The fundamental KDD architecture components as described by Han and Kamber (2001), Mallach (2000) and Inmon (2002) have been customised for the DataBabes research project. The FMMU environment includes disparate legacy and current production OLTP systems, including an archived DOS based mother/fetus database containing clinical and pathological markers. A similar, but not identical, dataset is found in the current production OLTP, relational database, GE ViewPoint™. A star-schema, as described by Inmon (2002) has been used for CVS data mart construction. CVS procedures populate the fact table with patient details and procedure details are dimension tables in the star schema, as illustrated in Figure 5.2. 95 3 $ 12 + 12 + % $ $% 3 % & # $ & ' 89: ( / ' 89: ( 56 ;9: ( ;< 8=: ( " * * (7 ( / + 32 &%' $ $ &3 $* * 1 $ % $ % 3 % 0 % * $ " 12 " 1 % 0 1 $ % % % % + % $ 3 + % % % ' % % $1 $% 1 1 1 & / 2 % " $ 0 * ' ' " 4 4 4 + + + ' , 4 4 ! ' /2 1) ! + 3 + 3 + + " & 3 &# " 3 & 2$ $ " % % , $ * " / 2 0 1 1 % 1 1 ! 96 3 3 3 5 6 + $ " + % 0 ' 0 ' /2 1) 0 3 0 % ' % % * * ' 31 1 # $ " $ 3 % $ ! 5.4 Results Results of the case study can be viewed in two ways: (1) quantitative results from data mining activities and (2) qualitative outcomes from undertaking the data warehousing project. Results for (1) are disappointing with many missing data values and incomplete patient records hindering the creation of worthwhile data mining outcomes. However, this should not been seen as a wasted opportunity because through the conduct of the CVS analysis and data mining the fetal-maternal clinicians became far more aware of the importance of accurate data recording for strategic research purposes. Local data collection procedures were improved and data accuracy raised at fetalmaternal unit staff meetings. In addition, a part time research clinician position was established to focus on improving the existing datasets by recording missing data values by refering to data sources beyond the confines of the fetal-maternal unit, such as the NSW midwives database. Thus, the qualitative outcomes (2) were encouraging and a further opportunity to conduct CVS data mining will arise following data improvement. 5.5 Conclusions The lead researchers at the 2004 Pacific Asia Knowledge Discovery in Data (2004) conference were correct to draw new researchers attention to the challenge of ‘real world’ data sets. This case study exposed many of the issues the leading researchers highlighted – particularly the poor quality of real world data and complex nature of medical domain knowledge. However, given the positive response from the Director and Clinicians at the Liverpool fetal-maternal unit the resulting improvements in data quality inspire future research. The paradigm shift towards improving data capture during clinical sessions offers the potential to significantly improve the quality of the clinical research conducted in the fetal-maternal unit. However, this case study 97 demonstrates there are currently several factors impacting its mainstream adoption including data accuracy issues and senior hospital management reluctance to embrace exploratory data mining across patient data sets. Attempts to gain follow-on ethics approvals – after the granting of the initial approval – were unsuccessful. This was largely due to the concerns some senior hospital management had regarding the potential exposure of unexpected relationships across the patient data attributes. It should be noted that many clinicians and senior management were very interested in this case study and gave their full support to the ongoing research into IDSS in the medical domain. This attitude 98 was not shared by all. CHAPTER 6 CONCLUSION 6.1 Contribution to Knowledge This work began with identification of open research areas including: 1. Existing investigative methods used when data mining across patient medical data are inadequate for the demands of clinical practice and research. The null hypothesis driven medical research paradigm must inform data mining investigative methods in the medical domain. 2. In the medical domain improvement is required in the elicitation of domain knowledge for use within knowledge bases in IDSS. 3. The exploitation of IDSS in the medical domain, particularly in the Australian context, has been slow and clinicians have concerns regarding the content of knowledge bases found in IDSS. 4. Consideration of the feasibility of extending the CRISP-DM to enable its use in medical research driven by the null hypothesis paradigm. This thesis addressed these open research issues by making the following contributions to knowledge: 1. Existing investigative methods used when data mining across patient medical data are inadequate for the demands of clinical practice and research. The null hypothesis driven medical research paradigm must inform data mining investigative methods in the medical domain. This research defined extensions to the CRISP-DM to facilitate its use in clinical practice and medical research applications. In addition, this research exposed the parallelism between CRISP-DM and the Scientific Method and the importance of the role played by both exploratory and confirmatory data mining. 99 2. In the medical domain improvement is required in the elicitation of domain knowledge for use within knowledge bases in IDSS. My extended CRISP-DM offers a largely automated approach to data mining that generates electronic rule-sets based on clinical evidence captured and stored in electronic OLTP systems. This rule representation of knowledge is then suitable for use in knowledge bases in IDSS. 3. The exploitation of IDSS in the medical domain, particularly in the Australian context, has been slow and clinicians have concerns regarding the content of knowledge bases found in IDSS The framework for IDSS proposed in this research has been developed in collaboration with an Australian fetal-maternal medicine unit. The evidence based rule-sets are thus generated from the local datasets. The extended CRISP-DM offers an approach using an investigative method akin to the null hypothesis driven scientific method thus providing confidence regarding rigour and statistical significance. 4. Consideration of the feasibility of extending the CRISP-DM to enable its use in medical research driven by the null hypothesis paradigm. The extended CRISP-DM presented in this research demonstrates the manner in which this generic approach can be extended to enable its use in important medical research driven by the null hypothesis paradigm. 6.2 Future Research Specific future research in the fetal-maternal domain includes the development of a domain ontology. The details of sample attributes described in the Chapter 5 case study represent approximately 15% of the factors in this domain. The volume and complexity of factor relationships is vast yet the benefits of a fetal-maternal ontology would be immediately appreciated in the IDSS context. Similarly the capture of domain knowledge in a machine readable format would also benefit the fetal-maternal community through provision of a knowledge base for IDSS purposes. The development of methodologies and tools to capture this domain knowledge 100 has scope well beyond the fetal-maternal area into the broader health environment. Future research regarding the extended CRISP-DM could investigate the number of significant rule sets generated v’s spurious or meaningless rule sets for a given medical dataset. Developing metrics to measure the validity or impact of unexpected knowledge exposed during the exploratory data mining phase of the extended CRISP-DM is also an interesting area for future research. Finally, the development of an application to support the clinicians as they work through the additional, medical specific layers in CRISP-DM when used for clinical research would be a valuable, marketable future direction for research. This would be a change from most of the field’s earlier work which directed efforts to building tools for data mining ‘experts’ rather than domain knowledgeable end-users. 6.3 Conclusion A conclusion that can be drawn from this research is that there are strong parallels between the widely accepted CRISP-DM and the long established scientific method, as illustrated the diagram from Figure 3.3. Exploratory data mining and confirmatory data mining play an important part in the extended CRISP-DM methodology. The additional layers proposed within the CRISP-DM for medical research have been shown to accommodate the demands of the null-hypothesis paradigm. The feasibility of extending CRISP-DM to enable its use in medical research driven by the null hypothesis paradigm has been demonstrated. The layers : DataMining Rule-Set Generation, Selecting significant Rule-Sets, Formulating the Null Hypothesis, Running statistical processes to test the null hypothesis and finally loading accepted rule-sets into associated IDSS have been defined within this thesis. 101 The establishment of the extended CRISP-DM, utilizing additional proposed layers, enables this to be applied to medical research in accordance with Research Hypothesis 1. The framework for IDSS to support the extended data mining methodology has been defined, as described below. The importance of the domain ontology and knowledge base has been presented, specifically for the fetalmaternal domain. The extended CRISP-DM methodology operates within the highlighted Data Management component. The generation of rule-sets from the data management component ‘feed’ into the rule based knowledge representation. The importance of the integration between the extended CRISP-DM methodology and the supply of domain knowledge to meet the requirements of the proposed IDSS framework has been presented. The resultant IDSS with a data management component capable of exploiting the extended CRISP-DM methodology, as per Research Hypothesis 1, addresses the demands of Research Hypothesis 2. Research Hypothesis 1: The Cross Industry Standard Process – Data Mining(CRISP-DM) can be extended to enable its use in medical research driven by the null hypothesis paradigm and Research Hypothesis 2: An Intelligent Decision Support System (IDSS) can be defined for clinical practice and research including a data management component to exploit the extended CRISP-DM methodology. have both been addressed and shown to provide an integrated research outcome, specifically an extended CRISP-DM for use with null hypothesis driven research and a supporting IDSS framework for clinical practice and research. Therefore, these 102 hypotheses have been proven. REFERENCES Alavi, M., & Carlson, P. (1992). A review of MIS Research and Disciplinary Development. Journal of Management Information Systems, 8(4), 45-62. Alexandrini, F., Krechel, D., Maximini, K., & von Wangenheim, A. (2003). Integrating CBR into the health care organization. Paper presented at the 16th IEEE Symposium on Computer-Based Medical Systems, New York, New York, USA. Amardeilh, F., Laublet, P., & Minel, J. (2005). Documentation Annotation and Ontology Population from Linguistic Extractions. Paper presented at the KCAP '05, Banff, Alberta, Canada. Anderson, G. Lecture 1: Scientific Method. Retrieved February 8, 2006, from http://pasadena.wr.usgs.gov/office/ganderson/es10/lectures/lecture01/lecture01 .html Anthony, R.N. (1965). Planning and Control Systems: A Framework for Analysis. Cambridge, MA., Harvard University Graduate School of Business Management. Australian Health Information Council. (2003). Electronic Decision Support for Australia's Health Sector . Retrieved 11 May 2006, from www.ahic.org.au Australian Health Information Council. (2003) Electronic Decision Support Evaluation Methodology. Retrieved 11 May 2006, from http://www.ahic.org.au/evaluation/guidelines.htm Avison, D. (2002). Action Research: A Research Approach for Cooperative Work. Paper presented at the 7th International Conference on Computer Supported Cooperative Work in Design, Rio de Janeiro, Brazil. Avison, D., Lau, F., & Myers, MD. (1999). Action Research. Communications of the ACM, 42(1), 94-97. Babcock, B., Babu, S., Datar, M., Motwani, R. & Widom, J. (2002). Models and Issues in Data Stream Systems. Paper presented at the 21st ACM SIGMODSIGART Symposium on Principles of Database Systems, Madison, Wisconsin. Beckett, D. (2004). Scalable RDBMS report. Retrieved 4 June 2004, from www.w3.org/2001/sw/Europe/reports/scalable_rdbms_mapping_report Becquet, C., Blachon, S., Jeudy, B., Boulicaut, JF., Gandrillon, O. (2002). Strongassociation-rule mining for large-scale gene-expression data analysis: a case study on human SAGE data. Genome Biology, 3(12). Bench-Capon, T., Coenen, F., Nwana, H., Paton, R., Shave, M. (1993). Two aspects of the validation and verification of knowledge based systems. IEEE Expert, 8(3), 76-81. Bench-Capon, T., & Visser, P. (1997). Ontologies in Legal Information Systems; The Need for Explicit Specifications of Domain Conceptualisations. Paper presented at the 6th International Conference on AI and Law, Melbourne, Victoria, Australia. Berndt, DJ., Fisher, JW., Hevner, AR., & Studnicki, J. (2001). Healthcare Data Warehousing and Quality Assurance. Computer, December 2001, 56-65. Blackmore, K., & Bossomaier, T.R.J. (2002). Soft computing methodologies for mining missing person data. In Proceedings of Sixth Australia-Japan Joint Workshop on Intelligent and Evolutionary Systems (AJJWIES 2002), Canberra, ACT, Australia. 103 Bonifati, A., Cattaneo, F., Ceri, S., Fuggetta, A. & Paraboschi, S. (2001). Designing Data Marts for Data Warehouses. ACM Transactions on Software Engineering and Methodology, 10(4), 452-483. Boose, J. (1985). A Knowledge Acquisition Program for Expert Systems Based on Personal Construct Psychology. International Journal of Man Machine Studies, 23, 495-525. Boucelma, O., Castano, S., & Goble, C. (2002). Report on the EDBT'02 Panel on Scientific Data Integration. SIGMOD Record, 31(4). Brossette, S., Sprague, A., Hardin, M., Waites, K., Jones, W., & Moser, S. (1998). Association rules and data mining in hospital infection control and public health domain. Journal of American Medical Informatics Association, 5(4), 373-381. Cabena, P., Hadjinian, P., Stadler, R., Verhees, J. & Zanasi, A. (1997). Discovering Data Mining from Concept to Implementation. New Jersey, USA: PrenticeHall PTR. Carey, S. (1994). A Guide to the Scientific Method. California, USA: Wadsworth Publishing Company. Cesnik, B. (2002). Report of the Electronic Decision Support Governance Workshop. Retrieved 11 May 2006, from www.ahic.org.au Connolly, T., & Begg, C. (2005). Database Systems: A Practical Approach to Design, Implementation and Management (4th ed.). England: Addison-Wesley. Corsi, P., Weindling, P. (Eds.) (1983) Information Sources in the History of Science and Medicine. London: Butterworths. CRISP-DM. (2004). Retrieved 1 December 2004, from www.crisp-dm.org Devlin, B. (1997). Data Warehouse- from Architecture to Implementation. Reading Mass: Addison Wesley. DxPlain (2006) Lab of Computer Science, Massachusetts General Hospital. Retrieved 3 February 2006, from http://www.lcs.mgh.harvard.edu/projects/dxplain.html ETL Portal. (2006). DM Review Retrieved 1 November 2006, from http://www.dmreview.com/portals/portal.cfm?topicId=230206 Ewen, E., Medsker, C., Dusterhoft, L., Levan-Schultz, K., Smith, J., & Gottschall, M. (1999). Data Warehousing in an Integrated Health System; Building the Business Case. ACM, 47-53. Gahleitner, E., Behrendt, W., Palkoska, J., Weippi, E. (2005). On Cooperatively Creating Dynamic Ontologies. Paper presented at the ACM HT'05, Salzburg, Austria. Galliers, R.D. (1993). Research Issues in information systems. Journal of Information Technology, 8, 92-98. Golfarelli, M., Rizzi, S., & Vrdoljak, B. (2001). Data warehouse design from XML sources. Paper presented at the 4th ACM International Workshop on Data Warehousing and OLAP, Atlanta, Georgia, USA. Gomez-Perez, A. (2004). Retrieved 4 June 2004, from ontoweb.aifb.unikarlsruhe.de/Members/ruben/Deliverable%201.5 Goodwin, L., & Grzymala-Busse, J. (2001). Data Mining Approaches for Perinatal Knowledge Building. Handbook of Data Mining and Knowledge Discovery. New York: Oxford University Press. Goodwin, L., Iannacchione, A., Hammond, W., Crockett, P., Mahler, S., & Schlitz, K. (2001). Data Mining Methods Find Demographic Predictors of Preterm Birth. Nursing Research, 50(6), 340 - 345. 104 Goodwin, L., Maher, S., Ochno-Machado, L., Iannacchione, M., Crockett, P., Dreiseitl, S., Vinterbo, S., & Hammond, W. (2000). Building Knowledge in a Complex Preterm Birth Problem Domain. Paper presented at the AMIA Annual Fall Symposium, Philadelphia. Gorry, G.A, & Scott Morton, M. (1971). A Framework for Management Information Systems. Sloan Management Review, 13, 55-70. Graziano, A., & Raulin, M. (2003). Research Methods a Process of Inquiry (5th ed.). Boston: Pearson Education. Gross- Portney, L., & Watkins, M. (2000). Foundations of Clinical Research, applications to practice. (7th ed.). New Jersey: Prentice Hall Health. Gruber, TR. (1992). ONTOLINGUA: A Mechanism to Support Portable Ontologies, technical report: Knowledge Systems Laboratory, Stanford University Calfornia, USA. Hagland, M. (2004). Health Care Informatics Online Data Mining. Health Care Informatics. Retrieved 1 April 2004 from http://www.healthcareinformatics.com/ Han, J. (1995). Mining Knowledge at Multiple Concept Levels. Paper presented at the 4th International Conference on Information and Knowledge Management. Baltimore, Maryland, USA. Han, J. (1996). Data mining techniques. ACM SIGMOD Record, Proceedings of 1996 ACM SIGMOD international conference on management of data SIGMOD'96, 25(2), Montreal, Quebec, Canada. Han, J. (1998). Towards on-line analytical mining in large databases. ACM SIGMOD Record, 27(1). Han, J. (2002). Evolving data mining into solutions for insights: Emerging scientific applications in data mining. Communications of the ACM, 45(8), 54-58. Han, J., Chiang, J., Chee, S., Chen, J., Chen, Q., Cheng, S., Gong, W., Kamber, M., Koperski, K., Liu, G., Lu, Y., Stefanovic, N., Winstone, L., Xia, B., Zaine, O., Zhang, S., & Zhu, H. (1997). DBMiner: a system for data mining in reltaional databases and data warehouses. Paper presented at the 1997 Conference of the Centre for Advanced Studies on Collaborative research. Toronto, Canada. Han, J, & Kamber, M. (2001). Data Mining Concepts and Techniques (1 ed.). San Francisco: Morgan Kaufmann Publishers. Han, J., & Pei, J. (2000). Mining frequent patterns by pattern-growth: methodology and implications. ACM SIGKDD Explorations Newsletter, 2(2), 14-20. Han, J., Pei, J., & Yin, Y. (2000). Mining frequent patterns without candidate generation. ACM SIGMOD Record, SIGMOD'00, 29(2), 1-12. Hayes, P., Reichherzer, T., & Mehrotra, M. (2005). Collaborative Knowledge Capture in Ontologies. Paper presented at the K-CAP '05, Banff, Alberta, Canada. Heath, J., Heath, S., McGregor, C., Smoleniec, J. (2004). DataBabes: A Case Study in Data Warehousing and Mining Perinatal Data. Paper presented at CASEMIX, Sydney, Australia. Heath, J., & McGregor, C. (2004). Research Issues in Intelligent Decision Support. Paper presented at the UWS College of Science, Technology and Environment, Innovation Conference, Sydney, Australia. Heath, J., McGregor, C., & Smoleniec, J. (2005). DataBabes: A Case Study in FetoMaternal Clinical Data Mining. Paper presented at the Health Informatics Conference of Australia, Melbourne. 105 Hummer, W., Bauer, A., & Harde, G. (2003). XCube - XML For Data Warehouses. Paper presented at the 6th ACM International Workshop on Data Warehousing and OLAP, New Orleans, Louisiana, USA. Inmon, W. (2002). Building the Data Warehouse (3rd ed.)New York: Wiley. Ji, W., Naguib, R.N.G., & Ghoneim, M.A. (2003). Neural network-based assessment of prognostic markers and outcome prediction in bilharziasis-associated bladder cancer. Information Technology in Biomedicine, IEEE Transactions on, 7(3), 218-224. Johnson, S.B. (2004). The development of decision support systems to enable plant demographic research in the Australian cotton industry. Australia: Department of Primary Industries. Jung, G., & Gudivada, V. (1995). Automatic determination and visualisation of relationships among symptoms for building medical knowledge bases.Paper presented at the 1995 ACM Symposium on Applied Computing, Nashville Tennessee, USA. Kawamoto, K., Houlihan, C., Balas, E., & Lobach, D. (2005). Improving clinical practice using clinical decision support systems: a systematic review of trials to identify features critical to success. British Medical Journal, 330(7494). Kennedy, P. (2004). Extracting and Explaining Biological Knowledge in Microarray Data. Paper presented at the Pacific Asia Knowledge Discovery in Data (PAKDD) 2004, Sydney, Australia. Kimball, R. (1996). The Data Warehouse Toolkit. New York: Wiley. Kock, N.F, Avison, D., Baskerville, R., Myers, M. & Wood-Harper, T. (1999). IS Action Research: Can We Serve Two Masters? Paper presented at the 20th International Conference on Information Systems, Charlotte, North Carolina, USA. Kock, N.F, McQueen, R.J, Baker, M., (1996). Negotiation In Information Systems Action Research. Paper presented at the Information Systems Conference of New Zealand, Palmerston North, New Zealand. Kovalerchuk, B., Vityaev, E., Ruiz, J.F. (2000). Consistent knowledge discovery in medical diagnosis. IEEE Engineering in Medicine and Biology Magazine, 19(4), 26-37. Lee, S., Abbott, & P. (2003). Bayesian networks for knowledge discovery in large datasets: basics for nurse researchers. Biomed Inform, 36, 389-399. Little, J.D. (1970). Models and Managers: The Concept of a Decision Calculus. Management Science, 16(8), 466-485. Lord,S., Genski, V., & Keech, C. (2004). Multiple analyses in clinical trials: sound science or data dredging? Medical Journal of Australia, 181(8), 452-454. Lyman, J., Boyd, J., & Dalton, J. (2003). Applying the HL7 reference information model to a clinical data warehouse. Paper presented at the IEEE International Conferemce on Systems, Man and Cybernetics, 2003., Washington DC, USA. Mackinnon, J., & Glick, N. (1999). Data Mining and Knowledge Discovery in Databases - An Overview. Australian and New Zealand Journal of Statistics, 41(3), 255-275. Mallach, E. (2000). Decision Support and Data Warehouse Systems, New York: Irwin McGraw-Hill. Marakas, G.M. (2002a). Decision Support Systems in the 21st Century. Upper Saddle River, New Jersey: Prentice Hall. Marakas, G.M. (2002b). Modern Data Warehousing, Mining and Visualisation. Upper Saddle River, New Jersey: Prentice Hall. 106 Masuda, G., Sakamoto, N., & Yamamoto, R. (2002). A Framework for Dynamic Evidence Based Medicine using Data Mining. Paper presented at the 15th IEEE Symposium on Computer-Based Medical Systems, Maribor, Slovenia. Masuda, G., & Sakamoto, N. (2002). A framework for dynamic evidence based medicine using data mining. Paper presented at the 15th IEEE Symposium on Computer-Based Medical Systems, 2002. (CBMS 2002), Maribor, Slovenia. Mathews, J.R. (1995). Quantification and the Quest for Medical Certainty.New Jersey: Princeton University Press. Matsumoto, T., Ueda, Y., & Kawaji, S. (2002). A software system for giving clues of medical diagnosis to clinician. Paper presented at the 15th IEEE ComputerBased Medical Systems, 2002. (CBMS 2002), Maribor, Slovenia. McCarthy, J. (2000). Phenomenal Data Mining: From Data to Phenomena. ACM SIGKDD Explorations Newsletter, 1(2), 24-29. McGregor, C., Bryan, G., Curry, J., Tracey, M. (2002). The e-Baby Data Warehouse: A Case Study. Paper presented at the 35th Hawaii International Conference on System Sciences, Hawaii, USA. Miquel, M., & Tchounikine, A. (2002). Software components integration in medical data warehouses: a proposal. Paper presented at the 15th IEEE Symposium on Computer-Based Medical Systems, 2002. (CBMS 2002), Maribor, Slovenia. Ohsaki, M., Sato, Y., Kitaguchi, S., Yokoi, H., & Yamaguchi, T. (2004). Comparison between objective interestingness measures and real human interest in medical data mining. Paper presented at the 17th International Conference on Innovations in Applied Artificial Intelligence, Ottawa, Canada. Ohsaki, M., Kitaguchi, S., Yokoi, H., & Yamaguchi, T. (2005). Investigation of Rule Interestingness in Medical Data Mining. Active Mining, Springer(3430), 174189. Pedersen, T., & Jensen,C. (1998). Research Issues in Clinical Data Warehousing. Paper presented at the 10th International Conference on Scientific and Statistical Database Management, Capri, Italy. Piantadosi, P. (1997). Clinical Trials- A Methodologic Perspective (1st ed.). New York: John Wiley & Sons. Podgorelec, V., Kokol, P., & Stiglic, M. (2002). Searching for new patterns in cardiovascular data. Paper presented at the 15th IEEE Symposium on Computer-Based Medical Systems, 2002. (CBMS 2002), Maribor, Slovenia. Popp, R., Armour, T., Senator, T., & Numryk, K. (2004). Countering Terrorism Through Information Technology. Communications of the ACM, 47(3), 36-43. Povalej, P., Lenic, M., Zorman, M., Kokol, P., Peterson, M., & Lane, J. (2003). Intelligent data analysis of human bone density. Paper presented at the 16th IEEE Computer-Based Medical Systems, 2003, New York, New York. Qiao, L., Agrawal, D., & Abbadi,A. (2003). Supporting Sliding Windows Queries for Continuous Data Streams. Paper presented at the 15th International Conference on Scientific and Statistical Database Management, Cambridge, MA, USA. Raghupathi, Winiwarter, Werner, & Tan,J. (2002). Strategic IT Applications in Health Care. Communications of the ACM, 45(12), 56-61. Rao, R., Niculescu, R., Germond, C., Rao, H. (2003). Clinical and Financial Outcomes Analysis with Existing Hospital Patient Records. Paper presented at the SIGKDD, Washington DC. Rindfleisch, T. (1997). Privacy, Information Technology and Health Care. Communications of the ACM, 40(8), 93-100. 107 Robinson, J.B. (2005). Understanding and Applying decision support systems in Australian farming systems research. University of Western Sydney, Sydney. Roddick, J., Fule, P., & Graco,W. (2003). Exploratory Medical Knowledge Discovery: Experiences and Issues. SIGKDD Explorations Newsletter, 5(1), 94-99. Roiger, R., & Geatz, M. (2003). Data Mining, England: Addison Wesley. Sabou, M., Wroe, C., Goble, C., & Mishne, G. (2005). Learning Domain Ontologies for Web Service Descriptions: an experiment in Bioinformatics. Paper presented at the IW3C2, Chiba, Japan. Sackett, D., Rosenberg, W., Muir Gray, J., Haynes, B., & Scott-Richardson, W. (1996). Evidence based medicine: what it is and what it isn't. British Medical Journal, 312, 71-71. Schubart, J., & Einbinder, J. (2000). Evaluation of a data warehouse in an academic health sciences center. International Journal of Medical Informatics, 60(3), 319-333. Simon, H.A. (1960). The New Science of Management Decision. New York: Harper and Collins. Summons, P., Giles, W., & Gibbon,G. (1999). Decision Support for Fetal Gestation Age Estimation. Paper presented at the 10th Australiasian Conference on Information Systems, Wellington, New Zealand. Susman, G.I, & Evered, R.D. (1978). An Assessment of the Scientific Merits of Action Research. Administrative Science Quarterly, 23, 582-603. Sydney South West Area Health Service. (2006). Retrieved 27th December 2005, 2005, from http://www.sswahs.nsw.gov.au/Service_Facility.aspx Tsymbal, A., Cunningham, P., Pechenizkiy, M., & Puuronen, S. (2003). Search strategies for ensemble feature selection in medical diagnostics. Paper presented at the 16th IEEE Symposium on Computer-Based Medical Systems, 2003, New York, New York. Turban, E., & Aronson J. (2001). Decision Support Systems and Intelligent Systems. Upper Saddle River, NJ: Prentice Hall. Upadhyaya, S., & Kumar, P. (2005). ERONTO: A Tool for Extracting Ontologies from Extended E/R Diagrams. Paper presented at the SAC'05, Santa Fe, New Mexico, USA. Vassiliadis, P., Simitsis, A., & Skiadopoulos, S. (2002). Conceptual Modeling for ETL Processes. Proceedings of 5th ACM International Workshop on data warehousing and OLAP, McLean, VA, USA, 14-21. Wang, H., Fan, W., Yu, S., & Han, J. (2003). Mining concept-drifting data streams using ensemble classifiers. Paper presented at the 9th ACM SIGKDD, Washington DC, USA. Warren, J., & Stanek, J. (2005). Decision Support Systems. In Conrick & M (Eds.), Health Informatics Transforming Healthcare with Technology (pp. 252-265). Melbourne: Thomson. Webb, G., Han, J., & Fayyad, U. (2004). Panel Discussion. Paper presented at the 8th Pacific Asia Knowledge Discovery in Data, Sydney, Australia. Webb, G.I. (2001, August 2001). Discovering associations with numeric variables. Paper presented at the 7th ACM SIGKDD international conference on knowledge discovery and data mining, Boston, MA, USA. Webb, G.I, Butler, S., & Newlands, D. (2003). On detecting differences between groups. Paper presented at the 9th ACM SIGKDD international conference on knowledge discovery and data mining, Washington DC, USA. 108 Webb, G.I. (2000). Efficient search for association rules. Paper presented at the 6th ACM SIGKDD international conference on knowledge discovery and data mining, Boston, MA, USA. Wong, M.L., Lam, W., Leung, K. S., Ngan, P. S., & Cheng, J.C.Y. (2000). Discovering knowledge from medical databases using evolutionory algorithms. IEEE Engineering in Medicine and Biology Magazine, 19(4), 4555. Xintao, W., & Daniel, B. (2002). Learning missing values from summary constraints. ACM SIGDDD Explorations Newsletter, 4(1). Xu, Z., Cao, X., Dong., Y., & Wenping, S. (2004). Formal Approach and Automated Tool for Translating ER Schemata into OWL Ontologies, 8th Pacific Asia Knowledge Discovery in Data Conference, 2004, Sydney, Australia. Yu, C. (2004). A web-based consumer-oriented intelligent decision support system for personalized e-service. Paper presented at the 6th International conference on electronic commerce ICEC '04, Delft, The Netherlands. Yu, & P. (2004). Keynote Address. Paper presented at the 8th Pacific Asia Knowledge Discovery in Data, Sydney, Australia. Zaidi, S., Abidi, S., & Manickam, S. (2002). Distributed data mining from heterogeneous healthcare data repositories: towards an intelligent agentbased framework. Paper presented at the 15th IEEE Symposium on ComputerBased Medical Systems, 2002. (CBMS 2002)Maribor, Slovenia. Zdanowicz, J. (2004). Detecting Money Laundering and Terrorist Financing with Data Mining. Communications of the ACM, 47(5), 53-55. Zeleznikow, J., & Nolan, J. (2001). Using Soft Computing to build real world intelligent decision support systems in uncertain domains. Decision Support Systems, 31, 263-285. Zorman, M., Kokol, P., Lenic, M., Povalej, P., Stiglic, B., & Flisar, D. (2003). Intelligent platform for automatic medical knowledge acquisition: detection and understanding of neural dysfunctions. Paper presented at the 16th IEEE symposium on Computer-Based Medical Systems, 2003, New York, New York. 109 Appendix A Appendix A contains a sample set of attributes to convey the complex, timevarying, multi-dimensional nature of the fetal-maternal data. It is estimated that only approximately 20% of the attributes commonly found in the databases that support fetal-maternal are listed in Appendix A. MaternalPatient PatientId(X) EpisodeID(X) StreetNumber&Name(T) Suburb(T) State(T) Country(T) Postcode(N) DateOfBirth(D) Email(T) Ethnic Group(EC) FamilyDrawing(BLOB)HomePhone(N) MobilePhone(N) WorkPhone(N) MedicareNumber(N) MedicareLineNum(N)Insurance(B) InsuranceProvidor(T) InsurancePolicyNumber(T) MaidenName(T) Title(T) Surname(T) FirstName(T) MiddleName(T) Occupation(EC) Religion(EC) HospitalNumber(X) Fetus MaternalPatientID(X) EpisodeID(X) FetusID(X) 3Ventricle(N) 4Ventricle(N) 4Chamber(N) Abdomen(EC) AbdomenDescription(T) AbnormalVenouseReturn(B) Acardia(B) Achondrogenesis(B) AchondrogenesisType(T) AD1(N) AdditionalBiometry(B) AF_Comment(T) AFDeepestPool(N) AFDeepestPool$(B) AFIndex(N) AFIndex$(B) AFLeftLowerPool(N) AD2(N) AFLeftUpperPool(N) AFRightLowerPool(N) AFRightUpperPool(N)AortaDiam(N) AortaStenosis(B) AortaStenosis1(T) AorticCoarction(B) AorticIsthmusStenosis(B) AorticValveAtresia(B) ArCyst1(N) 110 ArCyst2(N) ArnoldChiariA(B) ArCyst3(N) Arachnoid(B) ArnoldChiariB(B) ArterialDoppler(B) ArthrogryposisMultiplex(B) ASD1(B) ASD2(B) AVSeptal(B) Balkenaplasia(B) BladderEntrophy(B) BodyStalkAnomaly(B) BPD(N) BPDFL(N) BPDOFD(N) BPF1(N) BPF2(N) BPF3(N) BPF4(N) BPF5(N) BPF6(N) BPFAccelerations(N) BPFAFVolume(N) BPFBodyMovements(N) BPFPlacentalGrading(N) BPFRespiratoryMovements(N) BPFScore(N) BPFSystem(N) BPFTone(N) Brachycephaly(B) Brain(N) BrandefOther(T) Brandeftext(T) BrochogenicCysts(B) BrochogenicCystsA(N) BrochogenicCystsDiam(N) Calcification(B) CAM(B) CAMSite(N) CAMType(N) CamptomelicDysplasia(B) CardDiamAP(N) CardDiamT(N) CardiaDiamT(N) CardiacTumour(B) CC(N) CCTC(N) CDH(B) CDHSite(T) Chest(N) Chest$(B) ChestOther(T) ChestText(T) ChestWallA(B) ChestWallB(B) ChestWallC(B) ChestWallD(B) ChondrodystrophicDystrophy(B) Chrom2(T) Chrom1(T) Chromosomes(EC) CloverLeafShape(B) CloacalExtrophy(B) CM(N) Coarctation(B) COMCarotidEDF(N) ComCarotidPI(N) CpmCarotidRI(N) ComCarotidVm(N) ComCarotidVmax(N) Cord(N) CordChoriangioma(B) CordChoriangiomaSize(N) CordCysts(T) CordKnot(B) CordRoundNeck(B) 111 CordSingleArtery(B) CordSiteDetails(T) CordDescription(T) CysticHygromas(B) CysticHygromasA(N) CRL(N) CysticHygromasB(N) Cysts(B) DawesRedmanCriteria(N) Degree(N) DHContentOther(T) DHContentA(B) DHContentB(B) DHContentC(B) DHContentD(B) Diagnosis(T) DILDetail(N) Dilation(B) DilatedCM(B) DirectPrep(N) Dolichocephaly(B) DoubleOutletLV(B) DoubleOutletRV(B) DPrepText(T) DRMinutes(N) DysDetails(T) Dysrhythmia(B) EarlyPregBiom(B) Ebstein(B) EchocardiographyDesc(T) EctopiaCordis(B) EFWMethod(N) EllisVanCrefeld(B) EmbryoStructure(N) EncephDetails(T) Encephalocele(B) EPAbdomen(T) EPBiometryDesc(T) EPBladder(T) EPBrain(T) EPFeet(T) EPHands(T) EPMalformation(B) EPMalformationDesc(T) EPOther(T) EPSkull(T) EPSpine(T) EPStomach(T) EstWeight(N) ESTWeightLbs(N) EstWeightOz(N) Exencephaly(B) Exomphalos(B) ExomphalosBladder(B) ExomphalosBowel(B) ExomphalosHeart(B) ExomphalosLiver(B) ExomphalosMeas(B) ExomphalosMesentery(B) ExomphalosStomach(B) FemurR(N) FetalHeartActivity(T) FetalMovements(EC) FibulaL(N) FibulaR(N) FirstTrimesterRisk(B) FootLeft(N) FootRight(N) Fetal Doppler Measurements 112 Fallot(B) MaternalPatientID(X) EpisodeID(X) FetusID(X) FetalDopplerDesc(T) Fetal Heart Rate Measurements MaternalPatientID(X) EpisodeID(X) FetalHeartRate(N) FHRAccels(N) FetusID(X) FHRAccelsHour(N) FHRBaseline(N) FHRCategory(EC) FHRDuration(N) FHRHighVarHour(N) FHRDecels(N) FHRHighVariationEpisodes(N) FHRLowVarHour(N) FHRLowVariationEpisodes(N) FHROverallVariation(N) FHRShortTimeVariation(N) FHRSignalLoss(N) FHRStart(DT) Extremities MaternalPatientID(X) EpisodeID(X) FetusID(X) Feet(N) Hands(N) Humerus(N) Femur(N) Radius(N) Ulna(N) Tibia(N) Fibula(N) Joints(N) ExtremitiesNormal(B) ExtremitiesDesc(T) Face MaternalPatientID(X) EpisodeID(X) FetusID(X) Eyes(N) Nose(N) Palate(N) Profile(N) FacialCleft(B) EarsAbnormal(B) EyesAbnormal(B) Macroglossia(B) Micrognathia(B) NoseAbnormal(B) FacialTumour(B) FaceOther(T) 113 !" " #$% ' & ( ' #)% & ! #)% #$% ' ! ' * ! #$% #& % # *% # " #$% & #)% & ( ! ! #)% + #$% .! #)% + , #)% ! -#*% * $ ! #$% & #)% + #& % " * , #)% & # *% #*% / #*% !% #$% 0 1! 2 0 1! 2 0 1! & #)% #$% ! 0 1! 2 * 34 5#)% 0 1! 2 * 34 56#)% 2 * 34 5#/% 0 1! 2 *7 0 1! 2 * 7 #)% 0 1! 2 * 7 8)*8/#)% 0 1! 2 *7 0 1! 2 * 7 8)*8/ / ! 2 *27 #)% / ! 2 *7 ! *7 9#/% #)% / 2 #)% #)% #)% 9#/% &! #$% & #$% # % 0 #*% 0 ! 0 #)% 0 0 & ( ! #)% 0 " & -#& % 0 0 -#& % #)% " 0 #)% * # *% 114 #*% & 0 0 -* ! .! #& % #& % -#& % ' ! #)% #$% 0 ! #)% 0 ! 0 ! 0 ! ! 0 ! * 0 0 0 0 ! .! ! 0 ! ! #)% 0 ! #& % 0 ! " & 0 ! ' ! #)% #)% & ! 0 ( # *% #& % 0 -#& % #*% & #/% " #& % #)% 0 # *% 0 ! #*% 0 #)% 0 #)% 0 7#)% 0 3#)% 0 ! #/% 0 ! 115 #)% ! #& % 0 #*% * #/% -#& %