Download A Framework for an Intelligent Decision Support

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
University of Western Sydney
School of Computing and Mathematics
A Framework for an Intelligent
Decision Support System (IDSS),
Including a Data Mining
Methodology,
for
Fetal-Maternal Clinical Practice
and Research.
By Jennifer Heath
A dissertation submitted in fulfilment of the requirements
of
Master of Science (Hons)
November, 2006
CERTIFICATION
I, Jennifer Heath, certify that this thesis, submitted to address the
requirements for the Award of Master of Science(Honours), in the School of
Computing and Mathematics, University of Western Sydney, is wholly my
own work unless otherwise referenced or acknowledged. This document has
not been submitted at any other Academic institution to meet requirements
for any Award.
16th November 2006
Jennifer Heath
2
TABLE OF CONTENTS
CERTIFICATION.......................................................................................................2
ACKNOWLEDGEMENTS ........................................................................................7
PUBLICATIONS RELATED TO DISSERTATION...............................................8
ABSTRACT..................................................................................................................9
CHAPTER 1 INTRODUCTION..........................................................................10
1.1
INTRODUCTION......................................................................................10
1.2
RESEARCH AIMS AND OBJECTIVES................................................12
1.3
RESEARCH RATIONALE ......................................................................12
1.3.1 Knowledge Discovery in Data (KDD) Approaches for Clinical Research .....12
1.3.2 Limitations of Existing OnLine Transaction Processing Systems for the
Fetal-Maternal Domain. ....................................................................................13
1.3.3 Slow Uptake of IDSS in the Fetal-Maternal Medical Domain..............14
1.4
RESEARCH APPROACH........................................................................15
1.5
CONTRIBUTIONS TO KNOWLEDGE.................................................19
1.6
THESIS OVERVIEW ...............................................................................19
CHAPTER 2 LITERATURE REVIEW..................................................................21
2.1
INTRODUCTION......................................................................................21
2.2
KNOWLEDGE DISCOVERY IN DATA (KDD) ...................................21
2.2.1
General Process of KDD....................................................................21
2.2.2
Data Mining........................................................................................26
2.2.3
Exploratory/Descriptive Data Mining..............................................31
2.2.4
Confirmatory/Predictive Data Mining.............................................32
2.2.5
Classification and Prediction ............................................................33
2.2.6
CRISP-DM..........................................................................................34
2.2.7
Critics ..................................................................................................36
2.2.8
Data Mining in a Medical Domain ...................................................37
2.2.9
Conclusions and Implications for this Research .............................38
2.3
INTELLIGENT DECISION SUPPORT SYSTEMS (IDSS).................38
2.3.1
Evolution of Intelligent Decision Support Systems.........................38
2.3.2
IDSS Components ..............................................................................43
2.3.2.1 Extraction-Transformation-Loading(ETL) Frameworks..........44
2.3.2.2 Data Warehouse Architectures.....................................................45
2.3.2.3 Knowledge Discovery in Data / Data Mining Frameworks .......48
2.3.2.4 Knowledge Base Architectures .....................................................48
2.3.2.5 Model Base Architectures .............................................................49
2.3.3 Conclusions and Implications for this Research ....................................49
2.4
IDSS for Clinical Practice and Research .................................................50
2.5.1
IDSS for the medical domain. ...........................................................50
2.4.2
Unrealized Potential for IDSS in the Medical Domain...................52
2.4.3
DxPlain – a medical IDSS from the United States of America. .....56
2.4.4
Medical IDSS in the Australian Context .........................................57
2.4.5
Conclusions and Implications for this Research .............................61
2.5
EXPERIMENT DESIGN AND CLINICAL TRIALS............................62
2.6.1
Dominant Medical Research Paradigm ...........................................67
2.6.2
Clinical Reasoning, Statistical Reasoning and KDD ......................67
2.6.3
Conclusions and Implications for Research ....................................69
CHAPTER 3 CROSS INDUSTRY STANDARD PROCESS – DATA MINING
(CRISP-DM) SPECIALIZED TASKS FOR MEDICAL RESEARCH................70
3.1 Introduction......................................................................................................70
3
3.2 Extended CRISP-DM enhancing Data Management Layer of IDSS..........71
3.3 ‘Outputs’ from Data Mining are ‘Inputs’ to the Knowledge Base of IDSS 72
3.4 CRISP-DM Specialised Tasks to Support Medical Research......................72
3.4.1
Initial Phase ........................................................................................74
3.4.2
Data Preparation Phase.....................................................................75
3.4.3
Testing Phase......................................................................................76
3.4.4
Assessment Phase ...............................................................................77
3.4.5
Usage Phase ........................................................................................77
3.4.6
The role of exploratory and confirmatory data mining .................78
3.5 Generation of Electronic ‘Rules’ for use in IDSS knowledge base .............82
CHAPTER 4 INTELLIGENT DECISION SUPPORT SYSTEMS (IDSS) FOR
CLINICAL PRACTICE AND RESEARCH...........................................................84
4.1 Introduction......................................................................................................84
4.2 Extraction-Transformation and Loading (ETL) Frameworks....................84
4.3 Data Warehousing Architectures ...................................................................86
4.3.1
Core Dimensions for IDSS ................................................................87
4.3.2
Domain Ontology ...............................................................................89
4.4 Data Mining / Knowledge Discovery in Data Frameworks .........................89
4.5 Knowledge Base Architectures .......................................................................90
4.6 Model Base Architectures ...............................................................................90
CHAPTER 5 ‘DATABABES’ CASE STUDY..........................................................92
5.1 Introduction......................................................................................................92
5.2 Aim ....................................................................................................................93
5.3 Methods.............................................................................................................94
5.4 Results ...............................................................................................................97
5.5 Conclusions.......................................................................................................97
CHAPTER 6 CONCLUSION...................................................................................99
6.1 Contribution to Knowledge.............................................................................99
6.2 Future Research .............................................................................................100
6.3 Conclusion ......................................................................................................101
REFERENCES.........................................................................................................103
APPENDIX A ...........................................................................................................110
4
FIGURES
Figure
Figure 1.1 Susman and Evered’s Action Research Cycle
Figure 2.1 ‘We are data rich, but information poor’
Figure 2.2 ‘Searching for knowledge(interesting patterns) in
your data’
Figure 2.3 Data mining as a step in the process of knowledge
discovery
Figure 2.4 Fields contributing to data mining
Figure 2.5 Concept hierarchy for attribute age
Figure 2.6 CRISP-DM phases
Figure 2.7 Four level breakdown of CRISP-DM methodology
Figure 2.8 Turban and Aronson’s IDSS
Figure 2.9 Heath and McGregor’s five zones of interest
Figure 2.10 A general model of an EDSS
Figure 2.11 The Scientific Method
Figure 2.12 Continuum of Experiment Design
Figure 2.13 Convergence of Clinical and Statistical Reasoning
Paradigm
Figure 2.14 Addition of KDD to Reasoning paradigm
Figure 3.1 Framework for IDSS, data management highlighted
Figure 3.2 Framework for IDSS, knowledge base highlighted
Figure 3.3 Parallelism between CRISP-DM and the Scientific
Method
Figure 3.4 CRISP-DM extended for Clinical Practice and
Research
Figure 4.1 Proposed IDSS Framework
Figure 5.1 ‘DataBabes’ components
Figure 5.2 Chorionic Villus Sampling star schema
5
Page
16
22
22
23
29
30
34
35
43
44
60
63
65
68
69
71
72
73
81
84
95
96
TABLES
TABLE
Table 2.1 Data mining operations and techniques
Table 2.2 Common type of DSS support
Table 2.3 Features of Clinical Decision Support Systems(CDSS)
important for CDSS effectiveness
Table 4.1 Datatype coding
6
Page
28
40
52
87
ACKNOWLEDGEMENTS
I must begin by acknowledging the invaluable guidance provided to me by
my Research Supervisor Dr Carolyn McGregor. Wrestling with the entire
research process was frustrating at times and Dr McGregor never once
faltered in her patience and quiet encouragement. I was fortunate in that Dr
McGregor offered both industry experience in the fields of data warehousing
and intelligent decision support systems and current, successful research
experience in Health Informatics. I lost count of how many times I baulked
at submitting abstracts, making presentations, applying for grants only to
have Dr McGregor listen to all my doubts, reject them all and have me
continue on over the next research ‘hurdle’.
Dr Liwan Liyanage joined as my co-supervisor towards the end of this
research and I also thank her for providing encouragement and insight.
Associate Professor John Smoleniec and Susan Heath, from the Fetalmaternal Unit at Liverpool Hospital have provided enthusiasm and plenty of
research support. Specifically they assisted in gaining ethics approvals,
research paper production and answering countless questions regarding the
data found in the Fetal-maternal domain. There is no way this research
would have been completed without Sue and John’s constant support and cooperation.
Last, but by no means least, I thank my husband Brad Wulff. Brad has been
very patient and provided lots of help with our children (Zoe and Megan) as
I tried the balancing act of wife, mother, full–time Academic and research
student. I appreciate his good humoured support and acknowledge that none
of this would be possible without him.
Thank You Carolyn, Liwan, John, Sue, and Brad.
7
PUBLICATIONS RELATED TO DISSERTATION
Heath, J., McGregor, C., & Smoleniec, J. (2005) DataBabes: A Case Study
in Fetal-maternal Clinical Data Mining, Health Informatics Society of
Australia General Conference, Melbourne, August 2005, CD ROM, 6 pages.
Smoleniec, J., Heath, S., Heath, J., & McGregor, C. (2005) DataBabes: A
Case Study in Fetal-maternal Clinical Data Mining, Poster, Perinatal
Society of Australia and New Zealand (PSANZ) , Sydney, March 2005.
Heath, J., Heath, S., McGregor, C., & Smoleniec, J. (2004) DataBabes: A
Case Study in Data Warehousing and Mining Perinatal Data, CASEMIX
Conference, Sydney, October 2004.
Heath, J., & McGregor, C. (2004) Research Issues in Intelligent Decision
Support, UWS College of Science Technology & Environment Innovation
Conference, Sydney, June 2004.
Smoleniec, J., Heath, S., Heath, J., & McGregor, C. (2004) Fetal-maternal
Data Warehouse and Data Mining, Poster, Perinatal Society of Australia
and New Zealand (PSANZ) , Sydney, March 2004.
8
ABSTRACT
Existing patient medical records are a rich data source with a potential to
support clinical research. Fragmentation of data across disparate medical
databases inhibits the use of these existing datasets. Overcoming such
disjointedness is possible through the use of a data warehouse. Once the data
is cleansed, transformed and stored within the data warehouse it is possible
to turn attention to the exploration of the medical datasets. Exploratory and
confirmatory Data Mining tools are well suited to such activities.
Traditionally medical research has been conducted in accordance with the
scientific method. Informal discussions with medical practitioners exposed a
lack of confidence in data mining activities as they are perceived to not
support the scientific method. This thesis demonstrates that there are strong
parallels between the scientific method and the Cross-Industry Standard
Process – Data Mining (CRISP-DM). Extensions to CRISP-DM, as
proposed in this thesis, can be provided to strengthen these parallels.
Establishing a clinical trial to investigate conditions such as lung cancer is
relatively straight forward given the large number of potential patients when
compared to rare, complex conditions found in the fetal-maternal domain.
The use of the extended CRISP-DM enables use of existing patient data and
sophisticated data mining techniques to generate potential ‘knowledge’.
The knowledge rules generated can, following clinician review, be used to
populate Knowledge Base Architectures in Intelligent Decision Support
Systems thus helping to overcome the labour intensive elicitation of domain
knowledge that hinders the establishment of IDSS in the medical domain.
This thesis is concerned with: demonstrating parallels between scientific
method and CRISP-DM; extending CRISP-DM for use with medical
datasets; and proposal of the supporting Intelligent Decision Support System
framework. This research has been undertaken using a fetal-maternal case
study.
9
CHAPTER 1
1.1
INTRODUCTION
INTRODUCTION
The use of data mining operations across patient medical data is gaining
interest in the research community. Little research has focussed on
improvements to data mining methodologies specifically targeting the
medical domain. This thesis proposes extensions to the Cross Industry
Standard Process- Data Mining (CRISP-DM) methodology specifically to
accommodate the demands of exploratory and confirmatory data mining
across existing patient data. The enhancements are made in the modelling
and evaluation phases of CRISP-DM specifically to support the null
hypothesis medical research paradigm.
Despite the clear demands of evidence based medicine, little research has
focussed on meeting the requirements of null hypothesis driven data mining
and this thesis presents the parallelism that exists between the extended
CRISP-DM and the scientific method.
Efficient elicitation of medical domain knowledge for inclusion in Intelligent
Decision Support System(IDSS) knowledge bases remains an open research
area. Using the extended CRISP-DM in knowledge discovery assists in the
automated extraction of domain knowledge from existing patient data. This
illustrates the immediate application of the proposed methodology to further
the research surrounding IDSS domain knowledge capture.
The knowledge thus generated, in an electronic format, can be used to
populate the knowledge component in an IDSS when a rule generating data
mining technique is chosen in the modelling phase of CRISP-DM.
The results of the Australian Government sponsored Electronic Decision
Support for Australia’s Health Sector study (Australian Health Information
Council, 2002) indicate that Australian medical clinicians are reluctant to
10
adopt electronic decision support systems partly due to concern regarding
the content of knowledge bases.
This research focuses on the following open research areas:
1. Existing investigative methods used when data mining across patient
medical data are inadequate for the demands of clinical practice and
research. The null hypothesis driven medical research paradigm must
inform data mining investigative methods in the medical domain.
2. In the medical domain improvement is required in the elicitation of
domain knowledge for use within knowledge bases in IDSS.
3. The exploitation of IDSS in the medical domain, particularly in the
Australian context, has been slow and clinicians have concerns regarding
the content of knowledge bases found in IDSS.
Aspects of this research are explored using a fetal-maternal case study.
11
1.2
RESEARCH AIMS AND OBJECTIVES
The hypotheses raised in support of the research rationale described above
are:
Research Hypothesis 1: The Cross Industry Standard Process –
Data Mining(CRISP-DM) can be extended to enable its use in
medical research driven by the null hypothesis paradigm.
Research Hypothesis 2: An Intelligent Decision Support System
(IDSS) can be defined for clinical practice and research
including a data management component to exploit the extended
CRISP-DM methodology.
1.3
RESEARCH RATIONALE
1.3.1 Knowledge Discovery in Data(KDD) Approaches for Clinical
Research.
Within the medical domain the data mining focus has been on hospital
organisational issues (Alexandrini, Krechel, Maximini & von Wangenheim,
2003; Berndt, Fisher, Hevner & Studnicki, 2001; Ewen, Medsker,
Dusterhoft, Levan-Schultz, Smith & Gottschall, 1999; Lyman, Boyd &
Dalton, 2003; Raghupathi, Winiwarter, Werner, & Tan,
2002; Rao,
Niculescu, Germond & Rao 2003; Rindfleisch, 1997; Schubart, Einbinder,
2000; Zaidi, Abidi & Manickam, 2002) such as efficient financial
management rather than patient data research.
Knowledge Discovery in Data researchers have documented the complex
nature of medical data and describe the unique difficulties encountered when
working with such data. ( Goodwin, Mahler, Ochno-Machado, Iannacchione,
Crockett, Dreiseitls, Vinterbo & Hammond, 2000; Goodwin & GrzymalaBusse, 2001; Goodwin, Iannacchione, Hammond, Crockett, Mahler, &
Schlitz 2001; Jung & Gudivada, 1995; Lee & Abbott, 2003; Masuda &
Sakamoto, 2002; Podgorelec, Kokol & Stiglic, 2002; Roddick, Fule &
Graco, 2003). As early as 1998 researchers such as Brossette (Brossette,
Sprague, Hardin, Waites, Jones & Moser, 1998) were attempting to use data
mining techniques to contribute to medical knowledge. However, the
12
application of data mining to historical patient medical data has mostly been
conducted in isolated research work (Goodwin et al. 2001; Goodwin et al.,
2001;
Jung & Gudivada, 1995; Kovalerchuk, Vityaev & Ruiz, 2000;
Masuda et al., 2002; Matsumoto, Ueda & Kawaji 2002; Povalej, lenic,
Zorman, Kokol, Peterson & Lane, 2003; Roddick et al., 2003; Tsymbal &
Aronson, 2003; Wong, Lam, Leung, Ngan & Cheng, 2000) and has not
made a transition to being a widely accepted technique to inform clinical
practice and research.
The Scientific Method’s null-hypothesis driver used in medical research
calls for a modification to the approach used for data mining across medical
data and yet Roddick et al. (2003) informs that little research has been
directed towards formalisation of such a revised approach. Neither medical
research nor computing research has focussed on this open research area.
This thesis explains why such a change is needed and puts forward a KDD
methodology that extends the widely accepted CRISP-DM. The extensions
include use of exploratory and confirmatory data mining and support the
null-hypothesis
paradigm
hence
strengthening
the
value
of
Data
Management components of IDSS specifically to support medical domains.
1.3.2 Limitations of Existing OnLine Transaction Processing Systems
for the Fetal-maternal Domain.
This research was motivated after a review of patient data analysis in
association with Professor Smoleniec, Director Feto-Maternal Unit,
Liverpool Hospital, South Western Sydney Area Health Service (SWSAHS).
It emerged that the existing on-line transaction processing system used to
assist in clinical practice was inadequate for clinical research, particularly
multi-dimensional data analysis. A data warehouse combined with other
elements of DSS would overcome some of the problems encountered at the
fetal-maternal unit. The need to bring disparate information systems together
to leverage knowledge discovery has been widely recognised and pursued by
both the research and commercial community (Berndt et al., 2001; Bonifati,
13
Cattaneo, Ceri, Fuggetta & Paraboschi, 2001; Devlin, 1997; Ewen et al.,
1999; Inmon, 2002; Mallach, 2000; Marakas, 2002b). Associate Professor
Smoleniec’s aim was to leverage existing patient treatment data to assist in
the practice of evidence-based medicine and to inform clinical research.
Comments from Associate Professor Smoleniec and other clinicians raised
my awareness of the questions that surround the use of KDD methodologies
and tools in the generation of medical evidence. There was a reluctance to
‘accept’ KDD results due to the lack of rigour associated with the data
mining activities. Associate Professor Smoleniec’s concerns regarding
statistical significance and data bias directed my work towards an improved
methodology – closer to the respected Scientific Method.
1.3.3 Slow Uptake of IDSS in the Fetal-maternal Medical Domain
Little research has tackled the complex issues surrounding the establishment
of a data warehouse and IDSS architecture in the fetal-maternal domain.
Goodwin et al (2000; 2001)
directed their efforts to the demanding,
emotive, challenging field of fetal-maternal health. Establishment of such
IDSS architectures has been rapid in areas driven by the commercial desire
to show a ‘profit’, including marketing, share trading. In these domains it
was perceived that the architectures could assist in discovering competitive
advantage type knowledge from disparate information systems. More
recently counter terrorism activities have explored IDSS and associated data
warehousing and data mining functions (Popp, Armour, Senator & Numryk,
2004; Zdanowicz, 2004) and innovative Australian researchers have
employed data mining techniques to assist in missing person profiling
(Blackmore & Bossomaier, 2002), cotton growth (Johnson, 2004) and farm
management(Robinson, 2005).
Despite the successes in other industries the health sector has been relatively
slow in adopting IDSS technologies. One of the inhibitors is the difficulty
encountered when eliciting domain knowledge for use in knowledge bases
within IDSS.
14
1.4
RESEARCH APPROACH
This research is non-empirical in nature focussing on ideas and frameworks
as described by Alavi and Carlson (1992). Continuing with the
classifications presented by Alavi and Carlson (1992) this non-empirical
research falls into the applied sub category with emphasis on conceptual and
illustrative elements, thus covering the “why” and “how” of a framework for
an IDSS, utilizing the null hypothesis medical research paradigm, for
Clinical Practice and Research.
An action research approach has been taken for this research. Galliers (1993)
earlier work indicates such an approach is suitable for research aimed at
methodology development. This approach is very suitable for the multidisciplinary nature of the research undertaken with a focus on core computer
science knowledge as extended and applied in a complex fetal-maternal
medicine environment. As with most action research projects, the aim here
has been to improve practice through collaborative work between
researchers and practitioners, with interventions at the research site to ‘test’
if the prototype and associated analysis methodologies were feasible.
Awareness of the dangers inherent in this joint endeavour research approach
ensured that I did not slip into merely a ‘consultancy’ role with the fetalmaternal collaborators. These dangers as described in earlier work (Avison,
2002; Susman & Evered, 1978), were overcome during the conduct of this
research by:
•
Careful negotiation at the beginning of the research to ensure
that there was an agreed set of aims from both the computer
science and medical participants. Presentations related to this
research
have
appeared
at
both
professional
medical
conferences (Heath, Heath, McGregor & Smoleniec, 2004) and
health informatics and computer science conferences (Heath &
McGregor, 2004; Heath, McGregor & Smoleniec, 2005) which
indicates the successful achievement of the common goals
across both disciplines. The importance of this negotiation is
15
emphasised by Kock, McQueen and Baker (1996) in their work
addressing the ‘initiative dilemma’ associated with action
research
•
Clear statement of research aim, theory and method presented
at the outset of this research work. This was necessary to gain
ethics approval from both institutions – South Western Sydney
Area Health Service (SWSAHS) and the University of Western
Sydney (UWS). These applications were both successful and
supported by SWSAHS and UWS.
•
Adherence to the action research process presented by Susman
and Evered (1978), see Figure 1.1 below.
Figure 1.1: Susman and Evered’s Action Research Cycle.
(Susman & Evered, 1978)
16
•
Inclusion of the ‘Specifying Learning’ phase ensures that rigour
is included in this research approach. In addition this aspect
ensures that a contribution to existing knowledge results from
the research.
Academic and Industry publications and
presentations have emerged from this research (Heath et al.,
2004; Heath & McGregor, 2004). This is largely due to the
contribution of new knowledge that arose from this action
research, hence ensuring this was not ‘just a one-off
consultancy’, but rather a genuine contribution to knowledge.
•
The work of Avison, Lau and Myers (1999) reflects the use of
Action Research for this fetal-maternal IDSS research,
particularly in matters of ‘mutually acceptable ethical
framework’ an essential requirement for Action Research
participants to minimise conflict during research process.
An interesting reflection on this research thesis is to consider the qualitative
Action Research framework guiding the conduct of this research. Contained
within this Action Research framework is acknowledgement of the strength
of the null hypothesis, quantitative research paradigm and exploration of the
potential for knowledge discovery in data to enhance such quantitative
research. An exploration of the merits of qualitative v’s quantitative research
is beyond the scope of this thesis, however this is an unusual thesis because
both paradigms are well regarded and embraced.
The double challenge of meeting both (1) the needs of action in an
organisation and (2) quality research made this research project more
difficult than a carefully constructed survey or experimental research
involving a set of data specifically collected for research purposes. Avison et
al. (1999) and Kock et al. (1996), considered the need to perform for two
masters, the demands of immediate research clients and the Academic
community in general. Action research is not an efficient research method as
substantial time is spent preparing client data and this IDSS research in the
17
fetal-maternal domain had to wrestle with data quality and management
issues throughout the research.
Panel discussions at the 2004 8th Pacific Asia Knowledge Discovery in
Data(KDD) conference in Sydney, greatly influenced my decision to adopt
an action research paradigm. Eminent persons in the KDD field including
Han (1995; 1996; 1998; 2002; Han et al., 1997; Han & Kamber, 2001; Han
& Pei, 2000; Han et al., 2000; Wang, Fan, Yu & Han, 2003) and Webb
(Webb, Han & Fayyad, 2004; Webb, 2001; Webb, Butler & Newlands,
2003; Webb, 2000) led panel discussions at the conference regarding future
directions of KDD research. The panel emphasised that future research focus
must be on tackling the use of previously described 'real world' data rather
than carefully constructed test data sets that confirm relative efficiency of
KDD algorithms. This is a challenge that I have undertaken in this thesis and
associated research work which necessitated the use of an action research
approach.
18
1.5
CONTRIBUTIONS TO KNOWLEDGE
This research contributes four key elements to knowledge, specifically:
(1)
Extensions to the CRISP-DM to facilitate its use in clinical
practice and medical research applications.
(2)
Recognition of the parallelism between CRISP-DM and the
Scientific Method and the importance of the role played by
both exploratory and confirmatory data mining.
(3)
A proposed framework for IDSS in a fetal-maternal domain.
(4)
Enhancements to the Data Management component within
IDSS through exploiting (1) and (2) to generate domain
knowledge for use in the Knowledge Base component of IDSS.
1.6
THESIS OVERVIEW
Chapter 1 begins with a consideration of the drivers behind this research.
These factors come from both the medical domain, specifically fetalmaternal, and computer science. The research approach used throughout the
conduct of this cross-disciplinary is also presented in Chapter 1. Chapter 1
concludes with a concise summary of the contribution to knowledge made
by this thesis and surrounding research activities.
Chapter 2 contains a summary of some of the literature reviewed and
considered in the conduct of this research. Initially the literature review
conducted was quite broad. To aid in keeping the focus of this thesis directed
towards the stated thesis hypotheses many of the literature reviewed sources
that were uncovered during this research, but not directly impacting on this
thesis have been omitted. Research included in the literature review is wide
ranging from the works of the Electronic Decision Support in Australia
committee to the British Medical Journal and mainstays of Computing
research such as ACM SIGKDD, SIGMOD, IEEE and foundation DSS /
KDD researchers such as Inmon, Aronson, Han and Kimball.
19
Chapter 3 presents the extended CRISP-DM building on the initial data
mining concepts and open research questions described in the literature
review. The parallelism between the Scientific Method and the extended
CRISP-DM is also explored in this chapter.
The Intelligent Decision Support System(IDSS) framework I propose is
presented and discussed in Chapter 4.
The ‘DataBabes’ Case study is presented in Chapter 5. This case study
includes a brief description of the activities conducted in my research
partner’s fetal-maternal unit at Liverpool Hospital. Details of a prototype
data
mart
developed
for
clinical
research
on
Chorionic
Villus
Sampling(CVS) is also presented in Chapter 5. This data mart instantiates
the extraction, transformation and loading component and data warehouse
component of the IDSS framework proposed in Chapter 3. Early data mining
activities conducted across the fetal-maternal data are also described in this
Chapter.
This thesis concludes with Chapter 6 presenting research conclusions and
suggesting future research directions resulting from this Masters(Hons)
research.
20
CHAPTER 2 LITERATURE REVIEW
2.1
INTRODUCTION
This literature review considers the research areas that contribute to the
framework for IDSS clinical practice and research. The scope of the
literature review includes:
•
Knowledge Discovery in Data(KDD)
•
Knowledge Discovery in Data in medical domains
•
Clinical practice and clinical research
•
Intelligent Decision Support Systems
Particular attention is focussed on the open issues within these research
areas, the impact on this thesis and future research challenges.
2.2
KNOWLEDGE DISCOVERY IN DATA (KDD) FRAMEWORKS
2.2.1 General Process of KDD
Since the 1960’s database technologies have been evolving from initial
primitive file processing systems to sophisticated and powerful database
systems, thus generating an abundance of data. Human comprehension is
insufficient to analyse these vast volumes of data. Society finds itself
described as ‘data rich but information poor’. Knowledge discovery in
data(KDD) is a concept born from the need to make better use of the vast
volumes of stored data, retrospective patient medical data is the particular
focus of this research.
Han and Kamber (2001) use the following figures to illustrate the ‘data rich
information poor’ situation and contrast to the search for knowledge from
within the data.
21
Figure 2.1: ‘We are data rich, but information poor’. (Han & Kamber, 2001)
The goal of KDD is to search for knowledge contained within the data. Han
and Kamber (2001) offer the figure Y to depict the reorganisation of data
and discovery of knowledge from within the data.
Figure 2.2: ‘Searching for knowledge (interesting patterns) in your data’.
(Han & Kamber, 2001)
The abbreviation KDD is used by Han and Kamber (2001) and Mackinnon
and Glick (1999) to abbreviate the phrase ‘Knowledge Discovery in
Databases’. In other spaces the abbreviation is more broadly ‘Knowledge
Discovery in Data’ which is the preferred term for this research thesis.
22
Some practitioners and researchers, including Marakas (2002b), suggest that
data mining is a synonym for KDD. Alternatively, others (Han & Kamber,
2001), including myself view data mining as an essential step in the process
of KDD. Han and Kamber (2001) illustrate the iterative knowledge
discovery process in the following figure:
Figure 2.3: Data mining as a step in the process of knowledge discovery.
(Han & Kamber, 2001)
Other sections of this thesis have described the databases, cleaning and
integration, data warehouse, selection and transformation and evaluation of
the KDD process. This part of the literature review focuses on data mining
aspects. Reviewing this diagram and considering the broader requirements
necessary to support data mining, such as data preparation, it is clear that
data mining is in fact a step within KDD. Data mining and KDD are not
interchangeable terms for the same activity.
23
The general process of KDD has been researched and published by a wide
variety of researchers/ research groups including:
1. Marakas (2002b) provides guidelines for the overall data mining process:
•
select a topic for study
•
identify the target data set(s)
•
clean and pre-process the data
•
build a data model
•
mine the data
•
interpret and refine
•
predict
•
share the model
2. Roiger and Geatz (2003) suggest a very similar KDD process model:
•
goal identification
•
creating a target data set
•
data pre-processing
•
data transformation
•
data mining(a best model for representing the data is created by
applying one or more data mining algorithms)
•
interpretation and evaluation
•
taking action
3. The Cross-Industry Standard Process for Data Mining(CRISP-DM)
(CRISP-DM, 2004)) includes the following, domain independent steps in
data mining.
1. Business Understanding
2. Data Understanding and Data Preparation
3. Modelling (actual work of data mining – specific hypotheses
are tested or automated discovery methods are run)
24
4. Evaluation
5. Deployment
The general processing described, in these three examples, provides an
iterative framework from which to explore further data mining and
knowledge discovery open issues. It is important to note that data mining is
only one step in the general KDD processing, however, it has received by far
the most research attention.
Each of these examples specifies a data pre-processing or data preparation
phase. Section 2.3.2.1 and 2.3.2.2 of this thesis considered extractiontransformation-loading frameworks and data warehouse architectures. The
discussion from these sections is directly relevant to the data preparation
stages identified in the above 3 KDD process examples. In addition,
researchers including Han and Kamber (2001), have described real-world
data that are often incomplete, inconsistent and noisy. Data preprocessing
includes:
•
data cleaning, routines to fill in missing values, smooth noisy
data, identify outliers and correct data inconsistencies
•
data integration, routines to combine data from multiple sources
•
data transformation, convert data into appropriate forms for
mining
•
data reduction, routines used to reduce th representation of data
while minimising the loss of information content. Example
techniques
include:
data
cube
aggregation,
dimension
reduction, data compression.
Researchers such as Vassiliadis, Simitiss and Skiadopoulos (2002) work
towards improving Extraction-Transformation-Loading(ETL) processing to
reduce the estimated 80% of data warehouse development time that is
commonly committed to data preparation. These researchers state:
25
Thus, it is apparent that the design, development and deployment
of ETL processes, which is currently performed in an ad-hoc, inhouse fashion, needs modelling, design and methodological
foundations. Unfortunately the research community has a lot of
work to do to confront this shortcoming. (Vassiliadis et al., 2002)
These researchers begin by suggesting a conceptual model, that is generic in
nature, to accommodate ETL processes. This is early research and there
remain many unanswered questions and issues for further work.
The presence of missing or incomplete data is commonplace in large realworld datasets. Xintao et al. (2002) state:
Missing values occur for a variety of reasons eg. omissions in the
data entry process, confusing questions in the data gathering
process, sensor malfunction and so on. (Xintao, Wu, Daniel, &
Barbara, 2002)
These researchers go on to present
linear algebra and constraint
programming techniques to learn the missing values using apriori-known
summary information and that derived from raw data. Learning missing data
is essential for populating data warehouses to enable data mining, this
presents as an open research area.
2.2.2 Data Mining
Han and Kamber (2001)offer the following summary of the evolution of
database technology leading to the field of data mining (DM).
•
1960's and earlier
o primitive file processing
•
1970's – early 1980's
26
o hierarchical, network and relational databases
o data modelling tools eg entity-relationship diagrams
o indexing and data organisation techniques eg B+tree,
hashing
o query languages eg SQL
o user interfaces, forms and reports
o query processing and optimisation
o transaction management eg recovery, concurrency control
o On-line transaction processing(OLTP)
•
Mid 1980's – present
o advanced data models eg extended-relational, objectoriented,object-relational, deductive
o application orientated eg spatial, temporal, multimedia, active
knowledge bases
•
Late 1980's – present
o data warehouse and OLAP technology
o data mining and knowledge discovery
•
1990's – present
o XML based database systems
o web mining
Han and Kamber (2001) go on to describe data mining as a step in the
knowledge discovery process and provide a definition of the term data
mining:
Data mining is the process of discovering interesting knowledge
from large amounts of data stored either in databases, data
warehouses or other information repositories. (Han & Kamber,
2001)
27
George Marakas (2002b) also provides a similar, high level definition of the
term data mining: Data Mining (DM) is the set of activities used to find new,
hidden or unexpected patterns in data.(Marakas, 2002b). A more
informative definition of data mining comes from Roiger and Geatz (2003):
Data mining is an induction-based learning strategy that builds
models to identify hidden patterns in data. A model created by a
data mining algorithm is a conceptual generalisation of the data.
The generalisation may be in the form of a tree, a network, an
equation or a set of rules.(Roiger & Geatz,2003)
Data mining incorporates techniques from many disciplines including:
statistics, machine learning(Zorman, Kokol, Lenic, Povalej, Stiglic & Flisar,
2003), high-performance computing, pattern recognition, neural networks,
data visualisation, image and signal processing and spatial data analysis.
Cabena, Hadjinian, Stadler, Verhees & Zanasi (1997) present four main data
mining operations and commonly used techniques:
1
Data Mining Operation
Data Mining Techniques
Predictive Modelling
Classification
Value prediction
2
Database segmentation
Demographic clustering
Neural clustering
3
Link analysis
Association discovery
Sequential pattern discovery
Similar time sequence
discovery
4
Deviation detection
Statistics
Visualization
Table 2.1: Data mining operations and techniques (Cabena et al., 1997)
Data mining can be conducted on a variety of different information
repositories including, but not limited to: relational databases (Goodwin et
28
al. , 2000; Goodwin et al. 2001; Mackinnon & Glick, 1999); data
warehouses; transactional databases; object oriented databases; objectrelational databases; spatial databases; temporal databases; text databases
and multimedia databases and the World Wide Web and streaming
data.(Babcock, Babu, Datar, Motwani & Widom, 2002; Qiao, Agrawal &
Abbadi, 2003)
A broad set of discipline areas are utilized when conducting data mining, as
illustrated by Han and Kamber (2001) in the following diagram:
Database
Technology
Statistics
Data
Mining
Information
Science
Visualisation
Machine
Learning
Other
Disciplines
Figure 2.4: Fields contributing to Data Mining (Han & Kamber 2001)
These researchers suggest five primitives for specifying a data mining task:
1. specification of data set to be mined. Users specify the database and
tables or data warehouse and data cubes containing data to be mined.
Conditions for selecting and grouping and attributes(or dimensions) to be
analysed when mining.
2. kind of knowledge to be mined eg. characterization, discrimination,
association, classification or prediction
3. background knowledge, often in the form of concept hierarchies.
Concept hierarchies express discovered patterns in concise, high-level
terms and differing levels of abstraction. For example the following
diagram, from Han and Kamber (2001) illustrates a concept hierarchy for
attribute age:
29
Level 0
all
Level 1
young
Level 2
20…39
middle-aged
senior
40…59
60…89
Concept hierarchy for attribute age. (Han & Kamber, 2001)
Techniques such as binning, histogram analysis, cluster analysis and
segmentation by natural partitioning may be used for automatic generation
of concept hierarchies.
4. Interestingness measures
5. Knowledge presentation and visualisation techniques which are used to
display the discovered patterns
This
literature
review
exploratory/descriptive
continues
with
a
consideration
of
versus confirmatory/predictive data mining. The
distinction between (1) motivations and (2) operations of these two broad
categories of data mining are fundamental to the innovative methodology
presented in this thesis. Chapter 3 considers the application of these two data
mining approaches when researching on patient medical data within a
Scientific Method framework.
Data mining systems can be classified according to the kinds of databases
mined, kinds of knowledge mined, data mining techniques used or the type
of applications adapted. (Han & Kamber, 2001) In addition, data mining can
be
classified
as
either
‘exploratory/
descriptive’’
‘confirmatory/predictive’.
Descriptive mining tasks characterize the general properties of
the data in the database. Predictive mining tasks perform
inference on the current data in order to make predictions. (Han
& Kamber, 2001)
30
or
2.2.3 Exploratory/Descriptive Data Mining
Exploratory or descriptive data mining assists an analyst in exploring the
data searching for interesting patterns. The analyst conducting exploratory
data mining uses domain knowledge, skill and intuition to guide the
exploratory data mining. This exploration is quite subjective and lacks
statistical rigour using post-hoc analyses. Any resulting statistical inference,
p-values, are a guide only.
Association Rule Mining involves showing data attribute value conditions
that occur frequently in a given set of data. (Han & Kamber, 2001) These
rules provide a starting point for data exploration and they are a popular tool
in exploratory data mining.
Association rules are of the form X⇒Y, that is
"A1 ^ …^ Am→B1 ^ …^Bn", where Ai (for i∈ {1,…,m}) and Bj (for j∈
{1,…,n}) are attribute-value pairs.
The association rule X⇒Y is interpreted as "database tuples that satisfy the
conditions in X are also likely to satisfy the conditions in Y"
Example(Han & Kamber, 2001):
contains(T,"computer")⇒ contains(T,"software")
[support=1%, confidence=50%]
meaning that if a transaction, T, contains computer", there is a 50% chance
that it contains "software" as well and 1% of all transactions contain both.
31
Only a single repeating attribute or predicate, contains, is used in this
example. Association rules that contain a single predicate are called singledimensional association rules. The above rule can also be written as:
"computer ⇒ software [1%, 50%]"
Any association rules arising from exploratory data mining can not be used
for prediction without careful analysis and consideration by domain experts.
The research reported by Ohsaki (Ohsaki, Sato, Kitaguchi, Yokoi &
Yamaguchi, 2004, 2005) considers automated approaches to determining the
value and ‘interestingness’ of association rules generated across medical
datasets. Association rules do not necessarily indicate causation. Clearly this
warning must be heeded when data mining across medical data. This is an
important fundamental concept that underlies the extensions to CRISP-DM
described in Chapter 4 of this thesis.
An interesting association rule mining research project is that of Becquet,
Jeudy, Boulicaut and Gandrillon (2002) who applied association rule mining
to gene-expression data analysis and found the results are complementary to
existing gene-expression clustering techniques. Association rules and data
mining with a focus on hospital infection control was researched and
reported in 1998 in the Journal of American Medical Informatics
Association (Brossette et al., 1998).
2.2.4 Confirmatory/Predictive Data Mining
Confirmatory data mining is objective with hypotheses and analyses planned
a priori a resulting p-values more meaningful than those generated via
exploratory data mining. A hypothesis or model formulated in exploratory
data mining is tested using confirmatory data mining techniques.
Confirmatory or predictive data mining is used to indicate an expected result
based on facts contained within the mined data source. Clearly the output of
32
Association rule mining can be included in the predictive data mining
category, but the warning regarding causation in section 2.2.3 above must be
heeded, particularly in a medical domain.
2.2.5 Classification and Prediction
Han and Kamber (Han & Kamber, 2001) introduce classification:
Classification is the process of finding a set of models (or
functions) that describe and distinguish data classes or concepts,
for the purpose of being able to use the model to predict the class
of objects whose label in unknown. (Han & Kamber, 2001)
Classification and prediction data mining generates a derived model based
on a set of objects whose class label is known, called training data. The class
label of data objects can be predicted using classification. The data mining
derived models may take many forms including:
o decision trees, flow-chart-like tree structure, each node denotes a test on
an attribute value each lower branch represents the outcome of test and
tree leaves represent classes or class distributions.
o classification(IF-THEN) rules, decision trees are easily converted into
classification rules
o mathematical formulae
o neural networks, collection of neuron-like processing units with
weighted connections between units.
Prediction is the term used when the actual class data values are to be
predicted rather than just the class labels. Prediction can also include
distribution trends based on available historical data. Relevance analysis is
an activity that often precedes classification and
prediction. There are
attributes found within source data that do not contribute to the classification
or prediction process. Relevance analysis involves identifying these
attributes and omitting them from the classification and prediction process.
33
2.2.6 CRISP-DM
The Cross Industry Standard Process – Data Mining (CRISP-DM) is a
process model that includes the CRISP-DM methodology, reference model
and user guide. This process model has been developed by a group of
industry based data miners to be non-proprietary and freely available and
was initially funded by the European Commission. CRISP-DM has, as it’s
initiators dreamed, been broadly accepted as a sound fundamental of data
mining activities as referenced in (Cabena et al., 1997; Roddick et al., 2003).
CRISP-DM has six general phases as illustrated in the following diagram:
Figure 2.6: CRISP-DM phases (CRISP-DM, 2004)
The CRISP-DM methodology is described as a hierarchical
process model, consisting of sets of tasks described at four levels
of abstraction (from general to specific): phase, generic task,
specialised task and process instance. (CRISP-DM, 2004)
At the highest level the six phases are:
(1) business understanding
(2) data understanding
(3) data preparation
(4) modelling
34
(5) evaluation
(6) deployment
Each of these phases has associated with it a set of sub tasks that are
spread across the lower levels of the hierarchical model as illustrated in
the diagram below, moving from generic to more specialised tasks.
CRISP
Process
Model
Mapping
CRISP
Process
Figure 2.7: Four Level breakdown of CRISP-DM methodology
(CRISP-DM, 2004)
Generic tasks are those that must be completed for all data mining situations
such as cleaning and reformatting data. Such tasks are completed for all data
mining activities including retail, financial, ant-terrorism and medical
scenarios. The third level of specialized tasks describes in more detail the
particular requirements, such as cleaning numeric, text or streaming data.
The lowest level, the process instance, holds a record of the execution and
results of a particular data mining instance.
35
2.2.7 Critics
Data Mining activities are viewed with scepticism by some researchers. This
issue has been raised in many publications including the work of Mackinnon
and Glick (1999) who warn of situations where statisticians have used the
term data mining to denote unsavoury ‘data dredging’ or ‘fishing
expeditions’ in search of publishable P-values. The identification and
acknowledgement of such concerns moved my research in a direction that
sought an acceptable framework within which to conduct data mining
activities in a medical domain. The result is the improved CRISP-DM
methodology that acknowledges the rigour of the Scientific Method and
places data mining activities into this rigour as detailed in Chapter 3.
Han and Kamber (Han & Kamber, 2001) also warn that there are many 'data
mining systems' available in the commercial market, however many can not
perform the advanced activities needed for a genuine data mining. These
researchers also warn against the inappropriate use of all-purpose data
mining systems:
Different
applications
often
require
the
integration
of
application-specific methods. Therefore, a generic, all-purpose
data mining system may not fit domain-specific mining tasks.
(Han & Kamber, 2001)
Marakas (2002b) summarizes the limitations of data mining as:
Identification of missing information.
Future systems need to include mechanisms for 'inventorying' the dataset to
determine sufficiency of attributes for DM.
Data noise and missing values.
Noise is the difference between a model and its predictions. Data is referred
to as ‘noisy’ when it contains missing, incorrect values or extraneous
columns. (Connolly & Begg, 2005) DM systems use statistical techniques to
deal with noise. These techniques rely on known distributions of data noise.
36
Future systems must incorporate more sophisticated mechanisms for treating
missing or noisy data.
These two limitations represent ongoing open research areas.
2.2.8 Data Mining in a Medical Domain
Data mining in domains such as market basket analysis, missing person’s
analysis, sales forecasting and customer management and retention exist in a
very different domain to that of fetal-maternal and broader medical
environments. Later sections of this thesis consider, in detail, the
implications of data mining in the medical arena where data mining is
conducted in the null hypothesis driven medical domain. There are other
issues that require special consideration when conducting data mining
activities in the medical domain, including but not limited to: ethics, data
availability, data quality and evidence based medicine demands.
Recent KDD research in the medical domain has exposed the importance of
close collaboration between knowledge engineers and clinicians due to large,
complex, heterogeneous, hierarchical, time-varying and quality-varying
datasets (Goodwin et al., 2001; Roddick et al., 2003).
The following issues emerge from current research as key matters for
consideration when data mining in the medical domain as described by
Roddick (2003):
o Investigative Method
o Rule Interpretation
o Working with a Considerable Knowledge Base
o Data Availability and
o Accuracy and Ethical Safeguards
Finding new patterns in medical data is a driving force behind KDD in
medical domain (Goodwin et al., 2001; Podgorelec et al., 2002; Roddick et
al., 2003). Innovative algorithms are constantly sought in KDD, however
37
there is an emerging view amongst some DM/KDD researchers that future
research focus must be on tackling the use of previously described 'real
world' data rather than carefully constructed test data sets that confirm
relative efficiency of algorithms (Webb, Han & Fayyad, 2004).
2.2.9
Conclusions and Implications for this Research
When applying the CRISP-DM methodology in the fetal-maternal domain,
and the broader medical domain, an over-arching research process or
investigative method must drive the activities in the lower three layers of the
CRISP-DM hierarchy. This is an important unaddressed, open research
question. Any data mining carried out on medical datasets will be of
restricted value in the medical field if it has not been produced using a
rigorous scientific-method driven approach. This conclusion and the
implications this has for future data mining in the medical domain is one of
the strongest influences on the concepts developed for this research thesis.
The research of Roddick et al. (2003) has particularly influenced my
research.
2.3
INTELLIGENT DECISION SUPPORT SYSTEMS (IDSS)
This section of the literature review begins with a brief introduction to the
components of a generic Decision Support System (DSS). Open research
issues are highlighted, with particular focus on issues that may impact the
medical domain, and the impact on this research is considered.
2.3.1 Evolution of Intelligent Decision Support Systems
Decision making has been the focus of substantial research as documented in
texts and papers including those of Anthony (1965) and Little (1970).
Marakas (2002a) states that the Decision Support System (DSS) concept
arose from the problems faced daily by organisational decision makers. In
the 1970’s two influential papers written by J.D. Little (1970) and Gorry and
38
Scott Morton (Gorry & Scott Morton, 1971) provided the genus for modern
DSS. Little recognised that managers needed a
Model-based set of procedures for processing data and
judgements to assist a manager in his decision making. (Little,
1970)
Gorry and Scott Morton (1971) introduced the term decision support system
and developed a two dimensional framework for computer based support of
management decision making. The dimensions of the framework include
continuous dimensions with classification of decision structure on the
vertical dimension and levels of managerial activity on the horizontal
dimension. This framework extended the previous classification of decision
structure, proposed by Simon (1960), and the managerial activity level of
Anthony (1965). Decisions made within the fetal-maternal domain, and
broader medical domain, largely fall into the semi-structured category
which, according to Gorry and Scott Morton (1971), benefit from DSS
support.
Marakas (2002a) defines DSS in the following way:
A decision support system is a system under the control of one or
more decision makers that assists in the activity of decision
making by providing an organized set of tools intended to impose
structure on portions of the decision-making situation and to
improve the ultimate effectiveness of the decision outcome
(Marakas, 2002a).
Other researchers such as Mallach (2000) and Turban (Turban & Aronson,
2001) and have provided various definitions for DSS which generally abide
by similar themes to that incorporated in Marakas’s definition above.
39
Marakas (2002a) presents a table to summarise the common types of support
provided by DSS, numbers have been added to Marakas’ list to aid future
reference in this literature review.
Common type of support provided by DSS
Explores multiple perspectives of a decision context
Generates multiple and higher quality alternatives for
consideration
Explores and tests multiple problem-solving strategies
Facilitates brain storming and other creative problem solving
techniques
Explores multiple analysis scenarios for a given decision
context
Provides guidance and reduction of debilitating biases and
inappropriate heuristics
Increases decision makers ability to tackle complex problems
Improves response time of decision maker
Discourages premature decision making and alternative
selection
Provides control over multiple and disparate sources of data
1
2
3
4
5
6
7
8
9
10
Table 2.2: Common types of DSS support.
Marakas (2002a) presents some characteristics common to most DSS
applications:
o Employed in semi-structured or unstructured decision contexts
o Intended to support decision makers rather than replace them
o Supports all phases of the decision making process
o Focuses on the effectiveness of the decision-making process
rather than its efficiency
o Is under control of the DSS user
o Uses underlying data and models
o Facilitates learning on the part of the decision maker
o Is interactive and user friendly
o Is generally developed using an evolutionary, iterative process
o Provides support for all levels of management from top
executives to line managers
o Can provide support for multiple independent or interdependent
decisions
40
o Provides support for individual, group and team-based decision
making contexts
As the concept of a DSS evolved from its inception in the 1970’s to the
present day numerous DSS variations have emerged including knowledgebased systems, artificial intelligence, expert systems, data visualisation
systems, executive information systems and group support systems.
Marakas (2002a) generally classifies components of a DSS into five distinct
parts:
1. The data management system
2. The model management system
3. The knowledge engine
4. The user interface
5. The user(s)
The data management component stores, retrieves and organises the DSS
data. Several subsystems make up this data management component
including the physical database(s), database management system and query
facility. In addition DSS security functions, data integrity and data
administration procedures are provided by the data management component.
The quantitative models and analytical capabilities of the DSS are provided
by the model management system. The model base, model base management
system, model execution and synthesis processors make up the model
management system. The model base in a DSS is the modelling counterpart
to the database. Decision models offer a simplified representation of reality
and may be broadly classified as Abstract or Conceptual decision models.
Abstract models include deterministic, stocabilistic, simulation and domainspecific models. Deterministic models ensure that no variable can take more
than one value at any given time thus, the same output values will result
from a given set of input variables. Stocabilistic models have at least one
uncertain variable in the model which must be described by some probability
41
function. Simulation models allow the testing of various outcomes and by
comparing the results the decision maker (DSS user) can determine the most
desirable alternative.
The knowledge engine provides the “brains” of the DSS. (Marakas, 2002a)
The knowledge base houses the domain-specific knowledge including the
rules, heuristics, boundaries, constraints, previous outcomes and other
information necessary for the problem domain. Elicitation of domain
knowledge for use in knowledge bases is an open research area and this
thesis presents a methodology for using existing historical patient records to
generate fetal-maternal domain ‘rules’.
Continuing with Marakas’ “brains” and “brawn” metaphor for an expert
system, the inference engine provides the “brawn”. The inference engine
(IE) puts the knowledge to work to produce solutions. The IE operates on a 3
phase control cycle: (1) match rules with given facts (2) select the rule to be
executed (3) execute the rule by adding the deducted fact to the working
memory. Operation of the IE is driven by deductive inference known as
modus ponens which states “if A is true and A implies B is true, then B is
true.” The counterpart rule modus tollens states “if A implies B is true, and
that B is false, then we can conclude that A is also false”.
The user interface provides the means by which the DSS user works with the
data, model and processing components of the system. The users are an
essential component of the DSS:
Without considering the user as part of the system we are left with a set of
computer based components that, by themselves, provide no useful function
at all. (Marakas, 2002a)
Recent Decision Support System Research is focussing on the use of the
DSS in a wide variety of problem domains including Australian farming
systems (Robinson, 2005) and cotton industry (Johnson, 2004) and
Zeleznikow and Nolan’s generic work in uncertain domains. (2001)
42
2.3.2 IDSS Components
A typical IDSS architecture as presented by Turban and Aronson
(2001)illustrated in Figure 2, below. The inclusion of the optional
knowledge management and model management adds the ‘intelligence’
factor to the definition of DSS provided in earlier sections of the literature
review. Some aspects of IDSS that fall beyond the scope of this research
have been excluded, for example user interface and report generator.
Other
Computer-based
System
Data: external
and internal
Model
Management
Data
Management
Knowledge
Management
User
Interface
DSS
Manager(user)
Figure 2.8: Turban and Aronson’s IDSS. (2001)
Heath and McGregor (2004) presents an expanded IDSS framework divided
into five zones of interest :
1. Extraction-Transformation and Loading Frameworks
2. Data Warehouse Architectures
3. Data Mining/Knowledge Discovery in Data Frameworks
4. Knowledge Base Architecture
5. Model Base Architecture
43
ETL
Frameworks
Data
Warehouse
Architecture
Data
Source
IDSS
Data
Management
Data Mining
/Knowledge
Discovery in
Data
Frameworks
Data
Warehouse
(DW) / Mart
Extract
Transform
Load
(ETL)
Process
Domain
Ontology
Knowledge
Base
Knowledge
Base
Architecture
Model Base
Model Base
Architecture
Figure 2.9: Heath and McGregor’s five zones of interest.
(Heath & McGregor, 2004)
The following sections of this literature review consider each of these zones
and document open research areas within each zone and across zones.
2.3.2.1
Extraction-Transformation-Loading(ETL) Frameworks
Research within Academia and industry has not focused on issues within
ETL (Xintao, Wu, Daniel, & Barbara, 2002). ETL involves approximately
80% of effort required to prepare a Data Warehouse for Decision Support
and Data Mining Activities (Inmon, 2002; Xintao, Wu, Daniel, & Barbara,
2002). ETL has received a fraction of attention from the Research
community. This is partly explained by the uncertainty and dubious nature
of 'real world' data that dominates this zone of interest. Exploration of data
mining algorithms, machine learning, rule induction, neural networks,
Bayesian techniques, Model Base and Knowledge Base construction and
exploitation as found in IDSS required 'clean', clearly defined data sets.
Generic activities often included in ETL are:
44
o Translating coded values, source systems may use I and O for
inpatients and outpatients and the data warehouse may use 1
for inpatients and 2 for outpatients.
o Encoding free-form values, patient disposition may be recorded
as ‘composed’, ‘calm’ or ‘relaxed’ and mapped to a numeric
value of 1.
o Deriving new calculated values
o Light or heavy summarization of data
o Addition of surrogate keys for use in DW rather than using
natural keys.
Some knowledge domains generate more ideal data for ETL, for example
sensor data in medical or engineering domains, however many transactional
datasets created by data source (DS) are not of suitable quality for inclusion
in DW and subsequent IDSS (Berndt et al., 2001). Improved definition of
ETL activities and provision of formal foundations for their conceptual
representation is an open research issue (Vassiliadis et al., 2002).
2.3.2.2
Data Warehouse Architectures
Data Warehouses have been exploited to manage persistent data, a current
research challenge is the functional support of transient data in continuous
data streams (McGregor, Bryan, Curry & Tracey, 2002; Qiao et al., 2003;
Yu, 2004). This is of particular interest in the medical domain given the
pervasive use of physiological sensor devices (McGregor et al., 2002). An
increasing volume of data needed for decision-making is stored in XML data
format. Researchers (Golfarelli, Rizzi & Vrdoljak, 2001) propose a semiautomatic approach for building the conceptual schema for a data mart
starting from the XML sources. Current research (Hummer, Bauer & Harde,
2003) introduces XCube, a family of XML based document templates, to
exchange data warehouse data to enable integration for creation of vendor
independent virtual or federated data warehouses. Future research lies with
the combination of XCube with the new Web Services paradigm (Hummer
et al., 2003). Some research (Miquel & Tchounikine, 2002) explores pushing
the boundaries of data commonly associated with data warehouses. These
45
researchers propose an integrated system that can store traditional data,
provided by OLTP systems and raw data acquired with specialized
electronic measurement devices, such as standard electrocardiogram. In
addition, they suggest a process database that stores software components to
realize 'on-the-fly' data transformation. Clinical Data Warehouses pose fresh
challenges to researchers including: need for complex data modeling
features, advanced temporal support, advanced classification structures,
continuously valued data, dimensionally reduced data and the integration of
very complex data (Pedersen & Jensen, 1998).
Domain ontologies are explicit specifications of how knowledge in a domain
is conceptualized (Gruber, 1992). Elicitation of domain knowledge is a
critical aspect of IDSS development. Domain ontologies are frequently
associated with research directed towards the semantic Web. Domain
ontologies have been included in the data warehouse architectures section of
this literature review because in Chapter 4 of this thesis, a domain ontology
forms part of the recommended IDSS architecture for use in the fetalmaternal domain. In addition the domain knowledge, elicited when creating
the ontology, is crucial within the IDSS Knowledge Base and for use with
query and reporting tools in the IDSS Data Management layer.
Gahleitner, Behrendt, Palkoska and Weippi (2005) state :
Ontology building is still much more of a craft rather than an
engineering discipline. Each development team usually follows
its own set of principles, design criteria and phases. (Gahleitner
et al., 2005)
Little research had followed on from the early recognition of a need for
systematic approach to domain knowledge elicitation. Some researchers
have dismissed attempts to directly involve domain experts as both time
consuming and unreliable (Jung & Budivada, 1995). However in 2005 a
number of publications have reported a research focus on the establishment
of automated and semi-automated methods to build domain ontologies from
46
existing systems and documentation (Amardeilh, Laublet & Minel, 2005;
Gahleitner et al., 2005; Hayes, Reichherzer & Mehrotra, 2005; Sabou, Wroe,
Goble & Mishne, 2005; Upadhyaya & Kumar, 2005). These researchers
agree that currently the creation of high-quality ontologies is a very time
consuming and expensive activity and suggest that overcoming this issue is
the motivation behind their research.
The work of Amardeilh et al. (2005) has focused on the semi-automatic
creation of ontologies from textual documents. The semantic annotation and
ontology population are dependent on the knowledge captured in the
documents by the domain experts. That research addressing use of textual
documents is of interest because much fetal-maternal domain knowledge is
captured in textual documents. As the approach proposed by Amardeilh et
al. (2005) matures it should be possible to employ in the proposed fetalmaternal IDSS framework to assist in elicitation of domain knowledge and
construction of a domain ontology.
Extracting ontologies from extended entity-relationship diagrams is under
research as described by Upadhaya and Kumar (2005). These methods and
resulting domain ontologies have not yet gained the confidence of some user
communities – of particular interest for this research is the medical
community. Research is continuing on the automated creation of ontologies
of particular note is the current research into a formal approach and
automated tool for translating entity-relationship schemata into Web
ontology and semantic markup language(OWL)(Xu, Cao, Dong & Wenping,
2004). Related ontology research explores ontology-learning and instancedata migration from databases. Two ongoing projects (Beckett, 2004;
Gomez-Perez, 2004)are developing tools for creating ontologies and
ontological instances from databases.
Gene ontologies have been successfully constructed and exploited by the life
sciences communities to assist in providing transparent access to integrated
scientific resources, of note is the Transparent Access to Multiple
Bioinformatics Information Sources(TAMBIS) project (Boucelma, Castano
47
& Goble, 2002)and UTS/Westmead Children’s Hospital research (Kennedy
et al., 2004).
Similar capture of domain knowledge, in form of an ontology, will enable
greater exploitation of existing DS in other domains. Innovative future
research needs to overcome the above mentioned obstacles to ontology
development to allow broader development and adoption of ontologies to
facilitate all aspects of IDSS.
2.3.2.3
Knowledge Discovery in Data / Data Mining Frameworks
These are an important part of the framework and have been elaborated on in
section 2.3 of this literature review.
2.3.2.4
Knowledge Base Architectures
Elicitation of domain knowledge for inclusion in IDSS Knowledge Bases is
an open area of research that has not captured much attention from the
research community (Boose, 1985) for similar reasons to that which hinder
ontology creation. Marakas (2002a) introduces the concept of people called
knowledge engineers who interview domain experts and gather the
information necessary for an IDSS knowledge base. In some domains such
an approach is suitable, for example collecting knowledge relating to the
reasoning required to determine if a potential banking customer is suitable to
take on a loan. The knowledge engineers utilize knowledge acquisition
techniques such as interviewing, protocol analysis and modelling. The very
complex nature of knowledge within the medical domain precludes this
labour intensive approach to knowledge acquisition.
Semi-automatic methods are current areas of research interest. Rule
induction from medical datasets is an ongoing, open research area
(Podgorelec et al., 2002). These researchers conclude that the suggested
semi-automatic
approach
to
knowledge
discovery,
with
physician
assessment, resulted in a good outcome. The outcome was a set of
automatically induced rules which the physicians assessed as mostly correct
and reliable. Physician acceptance of the Knowledge Base in any medical
48
IDSS is a challenge that will need to be met by future researchers to ensure
acceptance of IDSS (Cesnik, 2002). This research describes a gap between
the knowledge repositories and the instantiation of such knowledge in IDSS.
2.3.2.5
Model Base Architectures
The model base provides the financial, forecasting, management science and
other quantitative functions that provide analysis capabilities to the DSS
(Turban & Aronson, 2001). Typically the model component of an IDSS is
composed of: the model base; model base management system; modelling
language; model directory and model execution, integration and command
processor.
Common
business
functions
such
as
allocating
and
controlling
organisational resources are freely available. Some models specific to the
fetal-maternal domain have been developed elsewhere, such as Summons et
al (1999) work on fetal gestation estimation. However, the development of
suitable models for the analysis required in the fetal-maternal domain
remains an open research area.
2.3.3 Conclusions and Implications for this Research
The IDSS arena provides many opportunities for research – most of which
are beyond the scope of this thesis. The elements of open research that will
be carried forward in this thesis are those relating to the elicitation of domain
knowledge for establishment of (1) domain ontologies and (2) knowledge
bases for use in fetal-maternal IDSS.
The benefits of establishing a data warehouse are substantial and have
influenced the framework for IDSS in fetal-maternal domain delivering a
consolidated, transformed, clean fetal-maternal dataset. The practical aspects
of this are explored in the fetal-maternal case study in Chapter 5.
A brief comment on the importance of the users of DSS has been included in
this literature review. The brevity of this inclusion does not indicate a low
priority being placed on users within this research. In fact, it is the
49
acceptance and confidence of clinician users that influences the manner in
which the knowledge base is established. To ensure user acceptance of any
IDSS it is essential that the fetal-maternal knowledge base, and broader
medical knowledge bases, be produced via a sound scientifically based
methodology as emphasised in the Electronic Decision Support for
Australia’s Health Sector Report (Australian Health Information Council,
2003).
2.4 IDSS for Clinical Practice and Research
The Generic Application Foundations and The Base Management Subsystem
described in the research presented by Yu and Chien (2004) are aimed at
supporting consumers during all phases of on-line purchase. The
development of similar foundations and subsystems for medical IDSS
presents an open research issue that has not attracted substantial attention
from the research community. This may be due to lack of financial
incentives and complex nature of medical Base Management Subsystems
and the demands of evidence based medicine that must inform knowledge,
model and past case base components of the IDSS subsystems. (Cesnik,
2002)
2.4.1 IDSS for the medical domain.
Warren and Stanek (2005) consider the use of DSS in medical domains
indicating that health, like any industry, makes use of DSS for efficient
organisational management. These researchers emphasis that:
Where DSS for health become special, however, is in support
for clinical-decision making… A Clinical DSS provides
patient-specific healthcare advice. (Warren & Stanek, 2005)
With a particular interest in the use of DSS within a medical domain, Warren
and Stanek (2005) classify DSS into the following major categories:
50
o Data presentation/visualisation tools
o Support decision maker by rearranging existing data
for easier assimilation.
o Problem solving by search
o Searches an existing database using particular
parameters in a search query eg. ‘Given drug A and
drug B, search an interaction table and return all
entries involving A and B’ (Warren & Stanek,
2005).
o Case based reasoning
o Simulates reasoning by experience, attempts to
match current parameters with those held in the
systems knowledge which is embodied in a library
of past cases.
o Symbolic reasoning systems
o Typically consists of a knowledge base, inference
engine and current dynamic data currently being
processed. Knowledge stored in the knowledge base
is formalised, frequently as a rule, for example:
If[fasting plasma glucose >7.0 mmol/l] then
[infer the Dx: Diabetes mellitus] (Warren &
Stanek,, 2005)
where Dx is the diagnosis.
o Artificial neural networks
o Designed to be analogous to the neurological
function of the human brain. Uses a training data set
with known outcomes, to build a structure capable
of making classifications or predictions describing
the problem to be decided.
o Simulation modelling tools
51
o Uses models to create a simplified representation of
the real world to predict outcomes or explain
observed behaviours.
2.4.2 Unrealized Potential for IDSS in the Medical Domain.
In the April 2005 issue of the British Medical Journal, Kawamoto, Houlihan,
Balas and Lobach (2005) listed 15 features of DSS used in a clinical setting
and gave an example of each. The purpose of that research was to identify
which of these Clinical Decision Support System (CDSS) features assisted
in the provision of improved patient care. The 15 features and examples are
repeated here to enable a comparison to be drawn between what Marakas
(2002a) presents as types of support that can be provided by DSS, Table 1,
and the types of DSS support currently utilized in clinical practice.
(Kawamoto et al., 2005)
There is a ‘gap’ between the support a DSS can potentially provide and the
current manner in which DSS are utilized in the medical domain. This
research thesis identifies this unrealized potential and presents a
methodology that assists in closing the ‘gap’ drawing clinicians beyond just
relying on IDSS and DSS for monitoring tasks and moves into ‘what-if’ type
analysis through establishment of solid knowledge bases established through
sound scientific methods.
1
2
3
DSS Feature
General system features
Integration with charting or
order entry system to support
workflow integration
Use of computer to generate
the decision support
Clinician-system interaction
features
Automatic
provision
of
52
Example
Preventive
care
reminders
attached to patient charts
Patients overdue for ovarian
cancer screening identified by
querying a clinical database
rather than by manual chart
audits
Diabetes care recommendations
decision support as part of
clinician workflow
4
No need for additional
clinician data entry
5
Request documentation of the
reason for not following CDSS
recommendations
6
Provision of decision support
at time and location of decision
making
7
Recommendations executed by
noting agreement
8
Communication
content
features
Provision
of
a
recommendation, not just an
assessment
9
Promotion of action rather than
inaction
10
Justification
support via
reasoning
of
decision
provision of
11
Justification
of
decision
53
printed on paper form and
attached to relevant patient
charts by clinic support staff, so
that clinicians do not need to
seek out the advice of the CDSS
Electronic or manual chart
audits are conducted to obtain
all information necessary for
determining whether a child
needs immunisations
If a clinician does not provide
influenza vaccine recommended
by CDSS, the clinician is asked
to justify the decision with a
reason such as “The patient
refused” or “I disagree with the
recommendation”
Preventive
care
recommendations provided as
chart reminders during an
encounter, rather than as
monthly reports listing all
patients in need of services.
Computerised physician order
entry system recommends peak
and trough drug concentrations
in response to an order for
aminoglycoside,
and
the
clinician simply click ‘Okay” to
order the recommended tests
System recommends that the
clinician
prescribes
antidepressants for a patient
rather than simply identifying
patient as being depressed.
System
recommends
an
alternative
view
for
an
abdominal radiograph that is
unlikely to be of diagnostic
value,
rather
than
recommending that the order for
the radiograph be cancelled.
Recommendation for diabetic
foot exam justified by noting
date of last exam and
recommended frequency of
testing
Recommendation for diabetic
support via provision
research evidence
12
of
Auxiliary features
Local user involvement in
development process
13
Provision of decision support
results to patients as well as
providers
14
CDSS
accompanied
by
periodic performance feedback
15
CDSS
accompanied
conventional education
by
foot exam justified by providing
data
from
randomised
controlled trials that show
benefits of conducting the
exam.
System design finalised after
testing
prototypes
with
representatives from targeted
clinician user group
As well as providing chart
reminders for clinicians, CDSS
generates postcards that are sent
to patients to inform them of
overdue
preventive
care
services
Clinicians are sent emails every
2 weeks that summarise their
compliance
with
CDSS
recommendations for the care of
patients with diabetes
Deployment of CDSS aimed at
reducing unnecessary ordering
of abdominal radiographs is
accompanied by a ‘grand
rounds’
presentation
on
appropriate indications for
ordering such radiographs
Table 2.3: Features of Clinical Decision Support Systems(CDSS) important
for CDSS effectiveness. (Kawamoto et al., 2005)
Table 2.3, above, resulted from Kawamoto et al. (2005) literature searches
via Medline, CINAHL and the Cochrane Controlled Trials Register. Seventy
studies were included in the review. The auxiliary features 12-15 inclusive
are not unique to decision support systems. Features 12, 14 and 15 are ideal
concepts for use in all system development and implementations. Feature 13
is also not specific to a DSS, this is simply the provision of additional
outputs to patient recipients. Such a feature has great value in the medical
domain but does not require specific DSS support. A number of the features
include the use of a diary / date minding type functionality which can be
provided by many generic scheduling applications, again this does not call
on specialist DSS functionality. Coupled with the date reminders are the
recommended care/treatment protocols for particular conditions. A number
54
of the features listed note the monitoring of clinicians adherence to the
treatment protocols and highlight any variance. These features are also not
calling on the specialist capabilities of DSS, as described by Marakas in
Table 2.1. (Marakas, 2002a) The information system is storing
recommended treatment protocols, such as diabetes and ovarian cancer
preventive, and referencing these to provide reminders such as features 1, 2,
3 and 5. Databases holding patient history are routinely queried to identify
patients who meet a particular criteria, such as that described in features 2,
10 and 14. Structured Query Language(SQL) queries such as these are
readily run against relational databases without the need for a DSS.
Features 8 – 11 inclusive are moving into the DSS domain as described by
many researchers.(Cesnik, 2002; Mallach, 2000; Marakas, 2002a; Turban &
Aronson, 2001; Warren & Stanek, 2005). Features 8 and 9 are describing
the DSS making at least one recommendation for action and these features
could be condensed to one feature. Features 10 and 11 are crucial features,
unique to a clinical DSS and can not be omitted. Given the nature of
evidence based medicine these features are essential to enable the clinician
to ‘know’ what knowledge the CDSS is using in creating a recommendation.
The need for such easy access to evidence was highlighted by Cesnik (2002)
in mapping a future direction for DSS in Australia to assist in building
clinicians confidence in CDSS and acceptance.
Discussions at the Medical DSS in the 21st Century conference in Sydney,
October 2003 also indicated that clinicians were thinking of DSS as a tool to
assist in prevention of incorrect medication usage and reminders for
treatment protocols rather more sophisticated multi-dimensional analysis and
‘what-if’, scenario type analysis, options generating multiple scenario
analysis tool.
Comparing Table 2.1 with the features in Table 2.2 it is possible to identify
the open areas in DSS within a medical domain that have not yet been
widely explored. These include:
55
o Exploring multiple perspectives of a decision context
o Generating multiple and higher quality alternatives for
consideration
o Exploring and testing multiple problem-solving strategies
o Facilitating brainstorming and other creative problem-solving
techniques
o Exploring multiple scenarios for a given decision context
2.4.3 DxPlain – a medical IDSS from the United States of America.
The DxPlain (DXPlain, 2006) IDSS developed at the Laboratory of
Computer Science at Massachusetts General Hospital uses a modified form
of Bayesian Logic to derive clinical interpretations. DxPlain accepts a set of
clinical findings including signs, symptoms and laboratory data for a specific
patient. DxPlain then produces a ranked list of diagnoses that may explain or
be associated with the clinical manifestations. Of particular importance,
given the comments regarding Features 10 and 11 in the work of Kawamoto
et al. (2005) above, DxPlain provides justification for why each of the
diseases might be considered and suggest what additional clinical
information could be useful to collect. Unusual or atypical clinical
manifestations are also listed for specific diseases.
The knowledge base (KB) component of the DxPlain IDSS includes over
2200 diseases and in excess of 4900 clinical findings ie. symptoms, signs,
epidemiologic data, laboratory, endoscopic and radiologic findings. The
average disease description includes 53 findings. Each disease/finding pair
has two numbers describing the relationship. The first indicates the
frequency with which the finding occurs in the disease and the other the
degree to which the presence of the finding suggests possibility of the
disease. There are more than 230,000 individual data points in the
knowledge base depicting disease/finding relationships. An additional value
in the range 1-5 indicates how important it is to explain the presence of each
finding. This value is independent of any particular disease.
56
Each disease has two associated values: one indicates its prevalence as very
common, common, rare or very rare. The other value indicates the
importance, ranked between 1 and 5, and attempts to reflect the impact of
not considering the disease if it is present.
DxPlain grew from a stand alone application with a knowledge base of 500
diseases into a web-based version which can be subscribed to by Hospitals,
Medical Schools and Healthcare organisations. Many years work has been
involved in the establishment of this DSS. An extended CRISP-DM applied
over historical patient data would allow for rapid generation of disease,
treatment and clinical findings from SWSAHS fetal-maternal data sets to
populate a similar knowledge base, using the innovative methodology
presented in this thesis. DxPlain is considered here because it contains a
large dataset which has been refined over many years by clinicians. DxPlain
presents as a good example of a medical IDSS which has been persistently
used by an ever widening group of clinicians.
2.4.4 Medical IDSS in the Australian Context
In the Australian context, the Australian Health Information Council (AHIC)
provides advice to Health Ministers on how information management and
information and communication technology (IM&ICT) effort can be
harnessed to address current and emerging needs in health care delivery,
management and planning. This council recognised the need to research the
use of DSS in the Australian Health sector and formed the Electronic
Decision Support(EDSS) sub group. This group define Electronic Decision
Support as:
Electronic Decision Support is access to knowledge stored
electronically to aid patients, carers, and service providers in
making decisions on health care.
In November 2002 the findings of their research were published in a national
report Electronic Decision Support for Australia’s Health Sector (Australian
57
Health Information Council, 2003) and the Electronic Decision Support
Steering committee manages and gives direction to the recommendations
made in this report.
This report considers the following:
o A definition of electronic decision support
o Status of electronic decision support system implementation
o Evidence for benefits of using electronic decision support
systems
o The needs of clinicians and health information industry
o Barriers of successful implementation of EDSS
Of particular interest to my research are the barriers to successful
implementation of EDSS. These barriers are generalised and listed as:
1. Concerns about quality and safety aspects of the
systems
2. Gaining the acceptance of health professionals
3. Implementation issues
4. Level of investment required
From barrier (1) it is clear that one of the main areas of concern is the
content of the underlying knowledge base used in the EDSS. Specifically
concerns regarding whether knowledge bases have been translated
accurately into electronic form, whether they are based on medical evidence,
whether they are peer reviews and whether there have been trial to test the
‘rules’.
Barrier(2) raises confidence in the knowledge base again as a possible
barrier to clinician acceptance if EDSS. To quote the report:
They(clinicians) expect that the knowledge within such
systems must match that of the most trusted experts within each
area of clinical practice. They require accurate translation of the
knowledge base into an electronic format. Clinicians argue that
many systems are limited because of the quality of the data
58
entered and the failure to reflect local patient mix and practice
patterns. (Australian Health Information Council, 2003)
Discussions with my research partners at the Liverpool fetal-maternal unit
have re-iterated these particular concerns. Elicitation of domain knowledge
for inclusion in a knowledge base is an open research area. Inference of
fetal-maternal rules from the extensive data set held at Liverpool helps to
address the problem highlighted above regarding failure to reflect local
patient mix barriers. It is interesting to note that in a strict scientific research
approach the limitation of inferred rules to only a subset of the broader
population could be considered a weakness. Interestingly clinicians are
calling for local focus – as described in the Australian national report above.
All of these issues influence the conduct of my research, however (1) and (2)
have been considered and the proposed methodology described in Chapter 4
of this thesis is motivated by a desire to mitigate negative impacts of
clincians concerns about quality, safety and overall acceptance by users of
the fetal-maternal IDSS.
Need for guidelines to assist novices in evaluating EDSS. These guidelines
have been produced by The Centre for Health Informatics Research at the
University of New South Wales. (Australian Health Information Council,
2003). These guidelines describe the knowledge content of an EDSS as
being two components: the Inference Engine and Knowledge Base as
illustrated below and referred to previously in this thesis in section 2.3.2.4
59
Figure 2.10: A general model of an EDSS
The guidelines continue explaining that the content of the knowledge base is
a specific representation of the knowledge and recommendations in a
particular clinical area eg. recommending treatment for thyroid disorder or a
type of cancer. The inclusion if medical treatment protocols/guidelines are
considered to be necessary inclusions in the knowledge base. The knowledge
base contents are generally unreadable by humans and stored in a format
understood by the inference engine. The inference engine draws logical
conclusions using a particular method of reasoning. These guidelines also
emphasise the importance of using a reliable and valid knowledge source to
ensure compliance with evidence –based best practice leading to improved
quality, safety and consistency of care.
These EDSS guidelines published by UNSW Centre for Health Informatics
Research emphasis that conversion of the knowledge source into executable
content should be appropriately supervised. This is to ensure that no medical
domain knowledge is lost or accidentally changed in any way. When
multiple knowledge sources have been merged into a single knowledge
source – ie. the knowledge base within the EDSS, there is a possibility that
the different sources will make conflicting recommendations regarding
patient treatments/protocols. Carefully considered procedures must be in
place during conversion of knowledge source into executable content to
minimize the impact of the adverse factors described above.
Conversion of the medical knowledge source into an executable content, for
use in the inference engine, should conform to a standardised methodology.
These UNSW researchers acknowledge that there are few applicable
standards in the area and encourage the use of a documented and auditable
methodology.
Establishment of such a standard methodology is clearly an important open
research area.
60
My thesis pursues the development of an evidence-based process for
deriving knowledge source rules from the fetal-maternal domain for use by
the inference engine in the knowledge base component of the proposed IDSS
for fetal-maternal clinical practice and research. Chapter 4 of this thesis
describes my proposed methodology which answers the UNSW researchers
call for adequate clinical input and review and use of a documented,
auditable methodology during knowledge conversion into executable format.
2.4.5 Conclusions and Implications for this Research
The review conducted by Kawamoto et al. (2005) that inclusion of IDSS
within clinical workflows is important for efficiency and system acceptance.
IDSS, such as DxPlain, are not integrated within the clinician’s workflow
they are separate reference type systems that are not pervasive and
immediately available for use with the protocol monitoring and schedule
monitoring favoured in Kawamoto’s features. Integrating these currently
disparate components of IDSS is an open research area. The IDSS
framework described in this research provides an opportunity to improve this
situation by integrating patient data from online transaction processing
systems with a knowledge base (similar concept to DxPlain) to facilitate
clinical research and practice.
61
2.5
EXPERIMENT DESIGN AND CLINICAL TRIALS.
Gross-Portney and Watkins (2000) describe the ultimate purpose of health
professionals to be the development of a knowledge base that will maximise
the effectiveness of health practice. The evidence-based medicine practice is
a fundamental principle underlying health care which aims to ensure that
choices regarding patients are made based on evidence that has been
confirmed by sound scientific data. The British Medical Journal (Sackett,
Rosenberg, Muir Gray, Haynes & Scott-Richardson, 1996) defines evidencebased medicine as: The conscientious, explicit and judicious use of current
best evidence in making decisions about the care of individual patients.
Efforts to establish medical evidence include Clinical Research, defined by
Gross-Portney and Watkins (2000):
Clinical Research is a structured process of investigating facts
and theories and exploring connections … examining clinical
conditions and outcomes to establish relationships among
clinical phenomena, to generate evidence for decision making
and to provide the impetus for improving methods of practice.
The Scientific Method guides Clinical Research and is illustrated by
Anderson (2006) in the following schematic:
62
Figure 2.11: The Scientific Method (Anderson, 2006)
63
The principle of the scientific method involves a cycle of observationhypothesis-experiment. Carey (1994) offers a brief description:
In a nutshell, then, we can say that scientific method is a
rigorous process whereby new ideas about how some part of the
natural world works are put to the test (Carey, 1994)
The following is a brief description of the Scientific Method bringing
together concepts and explanations from a variety of sources (Anderson,
2006; Carey, 1994; Corsi & Weindling, 1983). I acknowledge that debate
continues in the scientific community regarding the scientific method,
philosophy, theology and sociology. I also acknowledge that there is not one
step-by-step recipe(Carey, 1994) followed by scientists in their research.
However, Figure 2.11 does represent a broadly held positivist view of the
scientific method which is appropriate for this research.
The scientific method begins with observations of phenomena. To
investigate further a testable hypothesis is developed in an attempt to
explain the observations. The veracity of this hypothesis is tested via further
experimentation or observations. If the hypothesis can not be verified to an
acceptable level of ‘certainty’ an iterative action follows with a refinement
of the hypothesis and retesting. If many tests of the hypothesis are
repeatable, and found to support the hypothesis, the importance of the work
can be raised to the level of a widely held theory. The potential exists for the
theory to then face scrutiny and progress to the status of a ‘law’.
The purpose of experimental design is to provide a mechanism for assessing
the cause-and-effect relationship between a set of dependent and
independent variables. Experimental design can be depicted as existing on a
continuum as illustrated below:
64
DESCRIPTIVE
EXPERIMENTAL
Describe
Populations
EXPLORATORY
Find
Relationships
Cause
and Effect
Figure 2.12: Continuum of Experiment Design
(Gross-Portney & Watkins, 2000)
A clinical trial is the most common type of experimental design in
epidemiology and provides the strongest evidence for cause and effect.
Clinical trials are prospective studies comparing the effect of an intervention
against a control. Piantadosi (1997) presents: a clinical trial is an experiment
testing medical treatments on human subjects.
Randomized clinical trials are the most ideal of experiment designs. GrossPortney and Watkins (2000) describe clinical research experimental designs
including Non-equivalent post test-only control group design. This is a
quasi-experimental design that uses existing patient groups. Investigative
groups can be drawn from existing clinical patient data. This experiment
design is particularly useful when ethical concerns preclude a true control
group or when investigating a rare condition or event. Patients treated in a
fetal-maternal domain generally present with unusual conditions not often
found in the broader patient group. Establishment of a clinical trial
surrounding these unusual cases may not be possible – even if a multi-centre
approach is adopted.
The difficulty in using the randomized method for rare conditions relates to
the need for a large enough ‘n’ to provide the power to find statistically
65
significant differences between study groups. This is not possible if there are
few cases to include in a study.
The research presented by Lord, Genski and Keech (2004) discusses the
issues of using clinical trial data to support other research investigations. As
described above, the establishment of a clinical trial is not always possible.
An emerging open research area involves addressing the challenge of
conducting meaningful, sound data analysis on existing data collected during
patient treatment. The research presented by Lord et al. (2004) does not
consider the foundation scientific method but rather assumes that clinical
trials are the only vehicle for such data analysis.
Gross-Portney and Watkins (2000) state the Historical Research approach to
research is weak:
Because sources, measurements and organisation of data are not
controlled, cause-and-effect statements cannot be made in
historical research. (Gross-Portney & Watkins, 2000)
The application of data mining techniques, in the analysis of the historical
patient data, may provide an opportunity to move this experimental design
from quasi status towards a stronger cause and effect experiment design.
Some research has combined data mining with retrospective studies of
patient searching for prognostic markers including Ji, Naguib and Ghoneim
(2003) and Goodwin et al. (2000; 2001) and Hagland (2004) reporting the
work of Dr Eric Bremer in Pediatric Brain Tumor research. The extensions
to CRISP-DM, as described in Chapter 4, facilitate such an innovative
approach.
It should be noted that the use of IDSS and KDD in healthcare organisations
has largely focussed on organisational issues (Alexandrini et al., 2003; Rao
et al., 2003) rather than exploration of patient historical data records. I see
this as another example of IDSS and KDD being used to improve the
66
commercial position of these organisations rather than advancing medical
knowledge.
2.5.1
Dominant Medical Research Paradigm
The Null hypothesis – from the Latin nullus, meaning ‘not any’, states that
there is no difference between the comparison groups and is usually
established to be disproved rather than proved. In a clinical setting the null
hypothesis expects no clinical effect or difference beyond chance difference
(Graziano & Raulin, 2003; Piantadosi, 1997). If a statistically significant
difference is found then the null hypothesis must be rejected. If the
differences are within chance limits, the null hypothesis is NOT rejected.
(Graziano & Raulin, 2003)
Rejecting the null hypothesis is not sufficient to draw a causal inference
between variables as factors other than the independent variable may impact
on the dependent variable under investigation. Confounding variables –
factors other than the independent variable may have an effect on the
dependent variable. Roddick (2003) refers to the impact the null hypothesis
paradigm has on the typical processes adopted in knowledge discovery in
data(KDD) and indicates this is an open area for research. The enhancement
of CRISP-DM, respecting needs of evidence-based medicine and strength of
null hypothesis paradigm underpins the new process described in Chapter 4.
2.5.2 Clinical Reasoning, Statistical Reasoning and KDD
Clinical and statistical reasoning converge in clinical research (Piantadosi,
1997) and interestingly Mathews (1995)
describes resistance to use of
numerical comparison and statistical methods in evaluating therapeutic
efficacy. In clinical research, empirical knowledge comes from observations
and data and theory based knowledge comes from established biology or
hypothesis. Similarly, statistics uses empirical knowledge drawn from data
or observations and theory based knowledge of probability and determinism
which has been formalized in mathematical models (Piantadosi, 1997). Well
established statistical techniques exploited in clinical research include
67
regression, generalized linear models, regression trees, time series and
analysis of variance (Han &Kamber, 2001).
Figure 2.13: Convergence of Clinical and Statistical Reasoning Paradigm
The KDD process, including the data mining sub-process, has been widely
researched and documented (Han & Kamber, 2001; Inmon, 2002;
Mackinnon & Glick 1999; Marakas, 2002b; Masuda & Sakamoto, 2002;
McCarthy, 2000; Roddick et al., 2003; Roiger & Geatz, 2003). Extending
the convergence of clinical research and statistics paradigm described above,
KDD is an additional, converging dimension that will increase the power of
clinical research that exploits historical data. This convergence is introduced
and explored throughout this thesis. This is an open research area that
remains unaddressed by other researchers.
68
Figure 2.14: Addition of KDD to Reasoning Paradigm
2.5.3 Conclusions and Implications for Research
KDD and data mining enhance the traditional statistical techniques by
bringing theoretical foundations and techniques of data mining including
machine learning, neural networks, association mining and clustering.
Application of these principles to Gross-Portney and Watkins (2000) Nonequivalent post test-only control group experiment design has not been
widely documented or researched and presents an open research area.
In addition, modification of established KDD methodologies to support null
hypothesis in clinical research is an open KDD research area that is
addressed in this thesis via the extended CRISP-DM methodology to be
employed in the data management component of IDSS.
69
CHAPTER 3 CROSS INDUSTRY STANDARD PROCESS – DATA
MINING (CRISP-DM) SPECIALIZED TASKS FOR MEDICAL
RESEARCH.
3.1 Introduction
This Chapter addresses Hypothesis 1 of this thesis by demonstrating that the
CRISP-DM can be extended to enable its use in medical research driven by
the null hypothesis paradigm.
Clinical trials are the current favoured approach to sound medical research.
These trials involve control groups and intervention groups and all
associated factors are strictly controlled and monitored. The Clinical Trials
are formulated and conducted in such a way as to disprove a particular null
hypothesis, thus proving a particular alternate hypothesis. This design is
considered to be a strong experimental design from which cause-and-effect
relationships between factors can be stated.
The patients treated in a fetal-maternal environment present with multiple
complex, rare conditions. The construction of a rigorous clinical trial – even
a multi-centre clinical trial – is more difficult to organise than say a clinical
trial for study of lung cancer patients, due to the small number of patients
presenting with these fetal-maternal conditions.
This leads us to interpretation of existing clinical data to support hypothesis
held in the fetal-maternal environment. Retrospective, historical studies have
traditionally not held up to the rigorous analysis possible through a
prospective, clinical trial. Such studies are considered quasi-experimental
designs.
This thesis proposes that by adopting an improved CRISP-DM that
specifically addresses the requirements of null hypothesis analysis, it is
possible for a retrospective, historical analysis to move from the arena of
quasi-experimental design towards the well regarded true experimental
design.
70
The knowledge thus generated, in an electronic format, can be used to
populate the knowledge component in an IDSS when a rule generating data
mining technique is chosen in the modelling phase of CRISP-DM.
3.2 Extended CRISP-DM enhancing Data Management Layer of IDSS
This thesis proposes extensions to the CRISP-DM. These new concepts are
to be applied within the Data Management Layer of the proposed IDSS
architecture, as highlighted below.
ETL
Frameworks
Data
Source
Data
Source
Extract
Transform
Load (ETL)
Process
Data
Source
Data
Warehouse
Architectures
IDSS
Data
Management
Data
Warehouse
(DW) / Mart
Domain
Ontology
Knowledge
Base
Knowledge
Base
Architecture
Model Base
Model Base
Architecture
Figure 3.1: Framework for IDSS, data management highlight
71
Data Mining
/Knowledge
Discovery in
Data
Frameworks
3.3 ‘Outputs’ from Data Mining are ‘Inputs’ to the Knowledge Base of
IDSS
The domain knowledge, generated by KDD activities, are applied in the
Knowledge Base component of the IDSS Architecture, as highlighted below.
ETL
Frameworks
Data
Source
Extract
Transform
Load (ETL)
Process
Data
Source
Data
Source
Data
Warehouse
Architectures
IDSS
Data
Management
Data
Warehouse
(DW) / Mart
Domain
Ontology
Data Mining
/Knowledge
Discovery in
Data
Frameworks
Knowledge
Base
Knowledge
Base
Architecture
Model Base
Model Base
Architecture
Figure 3.2: Framework for IDSS, knowledge base highlight
3.4
CRISP-DM Specialised Tasks to Support medical Research
The Literature review section of this thesis introduced both the Scientific
Method and the CRISP-DM. I propose in this thesis that the Scientific
Methods principle of observation-hypothesis-experiment rests comfortably
with the CRISP-DM with discovery data mining supporting the observationhypothesis phase and confirmatory data mining supporting the hypothesisexperiment. The following diagram illustrates my demonstration of the
parallelism between the Scientific Method and CRISP-DM which is very
applicable in the fetal-maternal (and other) medical domain. Other
researchers have not explored such parallelism previously and this is a
unique aspect of this thesis. Discussion of the switch from exploratory data
mining to confirmatory data mining is explored in the remainder of this
section.
72
%
"
!
"
!
#
!
!
#
"
!
#
Data Source
$
"
Figure 3.3: Parallelism between CRISP-DM and the Scientific Method
73
The diagram above illustrates the convergence I perceive in the various
concepts of CRISP-DM, exploratory and confirmatory data mining and the
scientific method. This comparison has not been drawn in existing research.
I have grouped the CRISP-DM and scientific method elements into 5
common phases and indicated the iterative nature of each and exposed the
scope for use of exploratory and confirmatory data mining.
The null hypothesis medical research paradigm, as discussed in literature
review, must be the driver behind the confirmatory data mining and
associated analysis for the scientific method.
3.4.1 Initial Phase
The purpose of the initial phase of CRISP-DM involves developing an
understanding of the business/organisational operation for the domain to be
investigated. Once an overall appreciation of the business/organisation is
established a more detailed analysis of the organisational data is undertaken.
There is no set timeframe specified for the initial phase of the CRISP-DM.
Organisational staff may, over a couple of years, months or weeks, observe
trends or exceptional circumstances in business operation that generate
particular datasets. Alternatively data analysts may be invited to use
exploratory data mining techniques to quickly establish some overall
business understanding and more detailed data understanding. Both of these
approaches identify aspects of organisation operation, from production to
sales, which could benefit from further analysis.
The initial phase of the scientific method begins with an observation of
phenomena. Similar to the initial phase for CRISP-DM there is no
established timeframe for this initial phase. Clinicians may over a period of
years, months or weeks observe trends or exceptional circumstances
regarding patients that generate particular datasets. From these observations
a hypothesis maybe formulated to explain the observations. Inviting data
analysts to use exploratory data mining techniques to establish some overall
74
appreciation of computerised patient data is, as yet, not a widely accepted
element of the initial phase of the scientific method. As noted in the
literature review of this thesis the nature of medical data is challenging due
to the demands of: working with a considerable knowledge base; data
availability and clinical data accuracy problems. It is in this initial phase and
the following data preparation phase that such issues must be dealt with in
using the scientific method for clinical research.
It is in this initial phase that the clincians formulate an alternate hypothesis
and generate an associated null hypothesis which will be used in the testing
phase, once the intermediate data preparation phase is complete.
Exploratory data mining takes many variables/attributes/factors into
consideration using a variety of techniques in search for systematic patterns.
The results and outcomes from the exploratory data mining are weak until
they are confirmed in the confirmatory data mining phase of the
methodology illustrated in Figure 3.3 above. Exploratory data mining
approaches provide essential information for consideration in the initial and
data preparation phases for both the CRISP-DM and scientific method.
Exploratory data mining techniques are not used for null hypothesis testing –
this is done using confirmatory data mining techniques in the testing phase.
3.4.2 Data Preparation Phase
The data preparation phase of CRISP-DM aims to prepare the datasets to be
used for modelling or the major analysis of the project. Attributes and rows
are selected from source database management systems. Not all data
contained within a business system are required – the scope is determined by
the business phenomena to be investigated. As noted in the literature review
of this thesis data preparation is an expensive time consuming aspect of
KDD and targeting particular data elements increases the efficiency of this
phase of CRISP-DM. The data quality is raised to the level required by the
analysis techniques to be used in the modelling within the following testing
phase. This may involve cleaning of data, inserting suitable default values or
75
estimating of missing data, transforming data, integrating data and syntactic
modifications.
The data preparation phase for the scientific method varies depending on the
type of study to be undertaken viz randomised, prospective, correlation or
retrospective. This thesis focuses on using exiting patient data therefore the
activities in the data preparation phase for correlation and retrospective
studies are very similar to those described above for CRISP-DM data
preparation. However, it is noted that ethical issues relating to patient
confidentiality and privacy add a level of complexity to the data preparation
phase for scientific method application in a medical domain.
3.4.3 Testing Phase
The testing phase for CRISP-DM involves modelling using a specific
modelling technique such as C4.5, neural network creation or decision tree
building. These models are not generated in an ad-hoc manner they are
selected and applied to the prepared datasets in response to the targeted area
of business operation identified in the initial phase. The models quality and
validity are first tested prior to application across the broader dataset. The
prepared organisational data is split into a test and train data set. The model
is built using the training set and the quality of the model is estimated using
the test dataset.
In the scientific domain traditional hypothesis testing aims to verify a priori
hypotheses about relations between variables/attributes/factors. An example
of such a hypothesis may be “ There is a positive correlation between
maternal age and occurrence of down syndrome is newborns”. Exploratory
data
mining
is
used
to
identify
relationships
between
variables/attributes/factors when there are incomplete or non-existent a
priori expectations. This is conducted in the initial phase of the methodology
presented in Figure 3.3. Ideally the exploratory data mining results should be
cross validated using a different data set or an independent subset of the
original patient data drawn from the data warehouse. This is done in the
testing phase. In addition to testing the null hypothesis it is also important to
76
test the predictive validity of any association rules identified in the
exploratory data mining phase.
3.4.4 Assessment Phase
Figure 3.3 has the traditional CRISP-DM evaluation step contained in the
assessment phase. CRISP-DM evaluation involves assessment of data
mining results checking for pertinence to initial business KDD objectives.
CRISP-DM calls for an evaluation of the DM outputs to determine the next
steps – does the process return to the initial phase and amend KDD scope for
further iterations or proceed to the deployment within the usage phase.
This is the crucial phase for assessing the strength of the null hypothesis.
The CRISP-DM evaluation step must be expanded in this assessment phase
for medical domain applications. The CRISP-DM guidelines are, as
intended, generic in nature with usefulness across most domains. For use in
the fetal-maternal domain, and medical domain more broadly, this
assessment phase must be strengthened with the traditional null hypothesis
evaluation regarding statistical significance and data bias. Confirmatory data
mining techniques and traditional statistical techniques support this
assessment phase for the scientific method. If the null hypothesis can not be
disproved then the alternate hypothesis can not be supported and, as
indicated in Figure 3.3, a return to hypothesis formulation is required.
This mirrors the evaluation of data mining models in CRISP-DM forcing a
return to the initial phase if needed.
3.4.5 Usage Phase
CRISP-DM deployment involves using positive evaluation results and
deploying, monitoring and maintaining data mining results in the business
workplace.
The usage phase for the medical domain is more complex as the demands of
evidence based medicine practice require ratification of results prior to use
77
in clinical practice. It is the demands of this ratification – starting with the
investigative method – that acted as a catalyst for this research.
3.4.6 The role of exploratory and confirmatory data mining
The tools used for exploratory data mining can also be used in the
confirmatory data mining phase, however the manner in which they are
employed is quite different. For example the C5.0 algorithm can be used in
the exploratory phase and in the confirmatory phase. In the exploratory
phase the intuition and experience of the clinician guides the selection of
attributes from the data warehouse for use in the data management layer
where the C5.0 is executed. The clinician has insight into the patient data set
and thus can reduce thousands of possible attributes for use in the C5.0,
down to a more manageable quantity. Recall that one of the characteristics
of medical patient data is the large number of dimensions, ie. potential
attributes/factors for consideration. Some researchers interested in the
scientific methodology may say that the use of clinicians to ‘whittle’ down
the factors to be considered by the C5.0 is an abuse of the method. Other
researchers believe that close co-operation between the clinician and data
analyst facilitates efficient CRISP-DM operation. Despite the validity of
both these opinions I believe that it is necessary for the data analyst to at
least confer with the clinician regarding factors to target in initial
exploratory investigations. In an ideal world, with unlimited resources, we
could run the C5.0 across all permutations and combinations of data
attributes/factors, however in reality this opportunity is unlikely to present.
A generic example is created here to illustrate.
A clinical data warehouse may contain 1000 attributes, A1, A2, A3 … A1000. An
initial clinician review of these attributes may highlight a subset of attributes
to be investigated using exploratory data mining techniques, this subset is
B1, B2, B3 … B200. Identification of patterns, clusters and previously unrealised relationships between 200 factors is a computational task beyond
human ability, therefore, the use of exploratory data mining techniques
remains valid. Continuing with the example, currently available data mining
78
tools, such as Clementine 8.0, facilitate the import of the 200 factors from
the data warehouse.
Using the filtering function available in Clementine 8.0 it is possible to
readily ‘explore’ the selected data set. It is important to have a data mining
tool suited to rapid iteration through exploratory data mining processes,
rapid display of results is essential. Consideration of the relative efficiencies
of various data mining algorithms is outside the scope of this thesis.
Through the exploration process the C5.0 algorithm, or other suitable
algorithm of choice, provides valuable insights into the data set. Close cooperation must be maintained between the data analyst and clinicians at this
iterative, exploratory stage of the data mining methodology. Clinicians can
expect to see predictable relationships identified by the data mining
algorithm. Continuing with the generic example the C5.0 algorithm may
return the following type of output
…
B5 = X [Mode:J] (500)
B78 = Y[Mode:K] (300)
B196 > 25 (210)
B196 <= 39[Mode :J] => P (90, 0.634)
B196 > 39 [Mode : J] (100)
Visualisation is another data mining tool that is useful during the exploratory
data mining phase of the new methodology. The output from the C5.0
algorithm is not readily interpreted by the clinicians. The advantages of data
visualisation are well documented and this tool is well suited to exploring
large patient data sets.
If the methodology is being effectively utilized, during the exploratory data
mining phase the clinicians should formulate some hypotheses that they wish
to consider more closely. This is the stage when focus shifts from
exploratory data mining to confirmatory data mining, as illustrated in Figure
3.3 above, hypotheses are to be tested with confirmatory techniques. As
previously stated the data mining tools used in exploratory work may also be
used in the confirmatory work – however the manner in which they are
79
employed is different. In the free-ranging exploratory phase the C5.0 was
used to randomly draw together various factors for consideration. The start
of the confirmatory phase occurs when clinicians formulate an alternative
hypothesis and move to a null hypothesis, as discussed in the literature
review of this thesis, refer section 2.6.1. With the clear statement of a null
hypothesis we must start with a fresh set of factors for consideration in the
C5.0 algorithm. Any factors/attributes not related to the stated null
hypothesis must be removed from consideration. We are left with the
factors/attributes of concern to the null hypothesis.
In this confirmatory phase traditional statistical techniques are combined
with the data mining algorithms to measure the ‘strength’ of the outputs
from the data analysis. It is at this point that we can become engrossed in the
issues surrounding the generation of clean data sets prepared specifically for
analysis purposes. Research abounds regarding the necessary quality
demanded of patient datasets to be used for confirmatory analyses. Section
2.6 of the literature review in this thesis addresses these matters and
identifies this as a potential open research area. The innovative methodology
I present here acknowledges these demands and proposes that by adopting
an improved CRISP-DM that specifically addresses the requirements of null
hypothesis analysis, it is possible for a retrospective, historical analysis to
move from the arena of quasi-experimental design towards the well regarded
true experimental design. The following figure illustrates the additional
layers added to the CRISP-DM to accommodate the needs of Clinical
Practice and Research.
80
4.2.1
&
New Process
' (
)
' (
"
&
,
*
-
.
+$
Extensions
For Medical
Data Mining
As proposed
by this thesis
*
/
%
$
%
Figure 3.4:
81
The Data Mining Rule Set Generation Layer is an exploratory data mining
layer. The rules generated from this layer are reviewed by a suitable
Clinician and Significant Rule Sets are Selected for further analysis. The
Clinician and data analyst work together to Formulate an Appropriate Null
Hypothesis which is carried forward into the confirmatory data mining
processes in the Run Statistical Process to Test Null Hypothesis Layer.
Ideally this would be conducted on a separate subset of confirmatory data to
generate a statistically sound outcome eg. meaningful p-values. Following
Clinician review and consideration of outcome from confirmatory data
mining, appropriate rule sets are loaded into the IDSS Knowledge Base.
3.5
Generation of Electronic ‘Rules’ for use in IDSS knowledge base
Clinicians have absolute control over the knowledge base rules that are
added to the electronic knowledge base. These Clinicians have the improved
retrospective study, including use of null hypothesis, to support the validity
of generated rules.
The elicitation of domain knowledge, in electronic format, for use in
knowledge bases has been investigated by other researchers (Bench-Capon,
Coenen, Nwana, Paton & Shave, 1993; Bench-Capon & Visser, 1997).
These researchers state that if the conceptualisation is explicit the knowledge
engineer has a framework to guide the acquisition of domain knowledge.
The MEKAS knowledge acquisition methodology presented by BenchCapon et al. (1993) attempted to address the need for a systematic approach
to domain knowledge elicitation by including an early stage where the
domain is conceptualised.
Little research has followed on from this early recognition of a need for
systematic approach to domain knowledge elicitation. Researchers including
Jung and Gudivada (1995) dismiss attempts to directly involve experts and
moves on to automated methods which have not yet gained the confidence of
the broad medical community:
82
Elicitation of medical knowledge required for determining
relationships, directly from human experts, is both time
consuming and unreliable. (Jung & Gudivada, 1995)
The approach recommended in this thesis is an improvement on earlier
elicitation methods because it begins with the accepted medical approach of
stating an alternate hypothesis followed by a null hypothesis and proceeds to
rejection or acceptance of the hypothesis. Ultimately, when appropriate, the
methodology moves to establish electronic rule-sets for clinician review
prior to populating the IDSS knowledge base with the generated rules.
83
CHAPTER 4 INTELLIGENT DECISION SUPPORT SYSTEMS
(IDSS) FOR CLINICAL PRACTICE AND RESEARCH
4.1 Introduction
The second hypothesis of this thesis is proven in this chapter as an
Intelligent Decision Support System (IDSS) is defined for clinical practice
and research including a data management component to exploit the
extended CRISP-DM methodology.
The following diagram illustrates the framework proposed for the IDSS to
be used in the medical domain. The five zones of interest reflect the
organisation of the literature review section 2.2
ETL
Frameworks
Data
Source
Data
Source
Extract
Transform
Load (ETL)
Process
Data
Source
Data
Warehouse
Architectures
IDSS
Data
Management
Data
Warehouse
(DW) / Mart
Domain
Ontology
Data Mining
/Knowledge
Discovery in
Data
Frameworks
Knowledge
Base
Knowledge
Base
Architecture
Model Base
Model Base
Architecture
4.2 Extraction-Transformation and Loading (ETL) Frameworks
The processes depicted in this framework for ETL are provided by
commercially available software tools and supporting customised enterprise
processes. As stated in section 2.3.2.1 formal foundations do not exist for
conceptual representation of ETL activities. In the medical domain the data
sources range from 3rd party, vendor applications such as General Electric
ViewPoint to ad hoc, disposable spreadsheets.
This IDSS framework proposal suggests using a commercial package for the
ETL software. Functionality required for the ETL software includes:
84
o Graphical user interface
o Formula parser and condition handling for use with a wide variety
of data types including ASCII, numeric, date fields, logical/binary
o Smart type casting
o Debugger with trace and breakpoint capability
o Explicit rollback and commit transaction handling
o Field mapping interface
o Scheduler to manage ongoing ETL without end user initiation
(ETL Portal, 2006)
This proposed IDSS framework recommends development of data extraction
scripts for use with the more recent OLTP database systems currently is use
across all medical units. These OLTP databases contain physiological and
demographic datasets. In addition treatment and procedure details are held
for every occasion of service for the registered patients. The structure of
these OLTP databases is relatively stable and data continues to accumulate
on a day-to-day basis.
Development of custom ETL scripts, using commercial data management
tools, is recommended as there is a reasonable rate-of-return on resources
expended in development due to the life expectancy of such heavily used
applications.
Prior to the execution of the ETL scripts it is necessary for the clinicians to
review the data and ‘clean it up’. Specifically, clinicians must identify
missing data elements, correct data values that are clearly invalid and ensure
that consistent units of measurement are used for the domain of each
attribute. This aspect of the ETL framework is the most time consuming, as
indicated by earlier, acknowledged research (Han et al., 1997; Inmon, 2002).
Improved day-to-day procedures are also required to ensure that new patient
data that is added to the OLTP is done in a correct manner. Awareness of the
85
importance of sound data recording tasks must be emphasised to the
clinicians.
A detailed project management plan is recommended for use by clinicians
involved in this data preparation phase. This is recommended to assist the
clinicians because they are required to use many different means by which to
validate or collect missing data.
For example:
o midwives/clinicians may need to refer to the NSW Midwives
database (or equivalent) to find information regarding
pregnancy outcomes
o it may be necessary to telephone the patients to determine / find
other missing aspects of data concerning their treatment by the
fetal-maternal unit
o other hospital systems may need to be queried to recover
patient demographic details or prior medical history
These data cleanup and pre-processing tasks are most likely to be undertaken
by clinicians who are also engaged in day-to-day treatment of patients,
therefore careful planning and monitoring of DW related tasks is essential. It
is not sufficient for administrative staff or computing staff to take on the data
pre-processing activities as detailed domain knowledge is required. Very
close collaboration between clinicians and IT/computing staff is required for
the activities undertaken in the ETL frameworks component of the IDSS.
4.3 Data Warehousing Architectures
The data warehouse architecture needed for the IDSS requires little in the
way of summarization and as far as possible atomic, original data values
should be used. Data is recorded and stored in OLTP databases at a fine
level of granularity and this needs to be preserved in the data warehouse to
assist in clinical research and practice.
86
The data is high dimensional data and for any given clinical practice analysis
or clinical research a variety of on-the-fly calculations maybe required.
These requirements are difficult to predict and thus using fine grain data in
the data warehouse provides greatest flexibility for future clinical analysis.
As described in the literature review of this thesis the data used in the
medical domain is complex, heterogeneous, often contains multiple-domain
hierarchies and is time-varying.
4.3.1 Core Dimensions for IDSS
This section lists some sample core attributes that should be included in a
medical IDSS. When clinicians choose to focus on particular research areas
this core set of data warehouse attributes will be extended as necessary.
Generic datatypes are provided, however these are dependant on each
instantiation of the IDSS and may need to be varied.
Codes used to indicate datatype are:
Code
DataType
EC
Encoded data eg. 1 for forceps, 2 for
suction
T
Text
BLOB
Binary Large Object
N
Numeric
B
Boolean
D
Date
DT
Time
X
As required by IDSS instantiation
Table 4.1: Datatype Coding
Some tailoring for the Australian context has been included eg. Medicare
number plus line number on medicare card is a unique social security
number held by all eligible Australians. This is equivalent to Social Security
Number in the United States. The Insurance details apply to some patients as
Private Health Insurance is optional in Australian communities.
87
Moving from consideration of generic medical IDSS to instantiation in an
Australian fetal-maternal domain requires consideration of the minimum
Perinatal
dataset
in
the
Australian
National
Health
Data
Dictionary(ANHDD). At the time of writing the minimum perinatal national
dataset does not include most of the terms used required in a fetal-maternal
IDSS data warehouse. Application of the generic frameworks exposes
particular challenges in a fetal-maternal domain, including:
•
Mothers can present on multiple occasions to the fetal-maternal unit,
therefore a maternal patient identifier must be combined with an
episode identifier to uniquely identify each patient pregnancy. Use of
surrogate keys is recommended with the natural keys in the data
warehouse of the fetal-maternal IDSS.
•
Frequently mothers present to the fetal-maternal unit carrying
multiple fetuses. Each of the fetuses must be monitored during the
initial and subsequent visits.
This presents a challenge from a data management point of view because if
two or more fetuses have the same sex it is difficult to distinguish between
them at some stages during pregnancy. Each fetus is identified within the
data warehouse using a concatenated key containing elements for Maternal
Patient Id, Episode Id and Fetus Id. Clinicians make their best efforts to
distinguish between the unborn fetuses but some times physiological
measurements are transposed or otherwise inadvertently associated with an
incorrect fetus. Ideally these errors would be detected and corrected in the
OLTP databases prior to addition to the IDSS data warehouse.
To further illustrate implementation challenges encountered when the
generic frameworks are applied in the fetal-maternal domain refer to
Appendix A. Appendix A contains a sample set of attributes to convey the
complex, time-varying, multi-dimensional nature of the fetal-maternal data.
It is estimated that only approximately 20% of the attributes commonly
found in the databases that support fetal-maternal are listed in Appendix A.
88
4.3.2 Domain Ontology
The construction and use of ontologies leads to tighter definitions of agreed
semantics which is essential when using domain knowledge for Decision
Support and Data Mining purposes. The literature review suggests a number
of approaches to creation of domain ontologies. The generic OLTP
databases should have technical documentation including the entityrelationship diagrams that describe entities, attributes, primary keys and
relationships utilized with the DBMS. The ER diagrams may be conceptual
or logical – either can be put to good use in ontology development. As
Hayes et al (2005) highlight Concept Map construction is a proven method
for explicating and communicating domain knowledge.
The proposed generic IDSS framework includes a domain ontology to assist
in comprehending complex data. As can be seen in the previous section and
Appendix A, the data supporting the mother and fetus is complex,
heterogeneous and time varying. The approach recommended for fetalmaternal IDSS ontology development uses a synthesis of two approaches:
(1) Use of concept maps and (2) use of ER diagrams for generation of fetalmaternal ontology.
4.4 Data Mining / Knowledge Discovery in Data Frameworks
The KDD framework must include, as a minimum, data mining tools
suitable for the generation of knowledge rules. Rule-set generation is
essential for implementation of the CRISP-DM extensions for clinical
research as described in thesis sections 3.4 and 3.5.
The KDD framework should ideally also have a user friendly software
solution to assist clinicians to conduct exploratory data mining across patient
data as proposed and discussed in thesis section 3.4.
Application of the generic KDD framework, as described above, to the fetalmaternal domain requires the outputs from the KDD framework to be
clinical evidence reporting. The generation of such clinical evidence is of
89
optimum value if a solid scientific discovery process underlies the reporting
and associated fetal-maternal KDD framework. This research recognises the
importance of (1) KDD frameworks and (2) scientific process as applicable
in a generic manner and specifically within the fetal-maternal domain.
4.5 Knowledge Base Architectures
The proposed framework for the IDSS recommends rule representation of
knowledge as this format is particularly applicable where it is necessary to
recommend a course of action based on observable events – such as a patient
presenting with particular symptoms and clinicians requiring a treatment
protocol.
This also melds smoothly with the proposed extended CRISP-DM
methodology, detailed in Chapter 3, which recommends rule-set generation
in the extension layers resulting from data mining activities. The adoption of
the extended CRISP-DM methodology and resultant rule representation also
helps address the open research issue raised in Literature Review section
2.5.1 by University of New South Wales Health Informatics researchers
requiring a documentable, auditable methodology for establishment of
knowledge bases for use with inference engines.
4.6 Model Base Architectures
The model base architectures must contain analytical models specifically
developed for each problem domain. It is insufficient to only have generic
strategic, tactical, operational and analytical models created for the broad
business market.
As an example, fetal-maternal models must be developed in close cooperation with fetal-maternal experts/clinicians. These models must be based
on sound evidence based medicine practices. Specifying the specific models
90
to be developed is beyond the scope of this thesis however of interest would
be models:
o to predict maternal and/or fetal response to a particular
treatment protocol eg. Intrauterine fetal transfusion of blood or
platelets
o to predict likely pregnancy outcome following chorionic villus
sampling or other invasive procedure at variable weeks
gestation
o anticipating likelihood and type of chromosomal abnormality
given maternal, paternal and fetal factors.
91
CHAPTER 5 ‘DATABABES’ CASE STUDY
5.1 Introduction
The purpose of this thesis chapter is to demonstrate the framework
introduced in Chapter 4 via a real world fetal-maternal case study. The goal
is to demonstrate the IDSS framework for clinical practice and research
including a data management component to exploit the extended CRISPDM.
Fetal-maternal Medicine Units world wide provide care and treatment to
mothers and their unborn children. The Fetal-maternal Medicine
Unit(FMMU) at Liverpool Hospital, Sydney is part of the Sydney South
West Area Health Service (SSWAHS) (Sydney South West Area Health
Service, 2006). SSWAHS is a New South Wales Government funded health
service for 1.3 million people. The Liverpool Hospital has been operating
continuously since the end of the eighteenth century. Patient services offered
by the Liverpool Hospital FMMU include:
First Trimester
o Ultrasound
o First trimester ultrasound
o First trimester screening – Nuchal Translucency(60%) with
PAPP A /free Beta HCG(90%)
o First trimester multiple pregnancy chorionicity assessment
o Karyotyping Procedure
o Chorionic villus sampling
Second Trimester
o Ultrasound
o Transvaginal cervical assessment
o Monochorionic twin monitoring for twin to twin transfusion
o Maternal uterine artery Doppler
o Maternal antibody ultrasounds and invasive monitoring of fetus
o Invasive Procedures
o Fetal blood sampling
92
o Amniocentesis
o Chorionic Villus sampling
Third Trimester
o Growth ultrasound / placental localisation
o Amniotic fluid index
o Biophysical profile
o Fetal Therapy
o Intrauterine fetal transfusion of blood or platelets
o Pigtail catheter insertion and drainage of fetal fluid
o Amniodrainage
Vast volumes of electronic data have been generated during the treatment of
patients at the Liverpool Hospital FMMU. 24,000 patient records exist in
two separate online transaction processing (OLTP) systems. These patient
records include a large number of physiological measurements and clinical
data for both the pregnant women and fetus(s) throughout the pregnancy.
Initially the FMMU utilized a DOS based Fetal Database and then moved to
the General Electric Windows based ViewPoint™ Fetal Database. These
applications are designed to assist in the day-to-day operation of fetalmaternal and similar units. Such transaction processing information systems
are optimised for data entry rather than data analysis operations as described
by Inmon (2002) and other researchers (Devlin, 1997; Kimball, 1996;
Marakas, 2002b).
Fetal-maternal Clinicians at the Liverpool FMMU entered into discussions
with me to investigate an improved approach to using the existing patient
data for clinical research purposes. The disparate datasets were well suited to
data warehousing and subsequent exploitation via a fetal-maternal IDSS.
5.2 Aim
The case study aims to test whether data mining and supporting technology
components can provide Fetal-maternal Medicine(FMM) Clinicians with an
improved environment for patient data analysis. Clinicians within the
FMMU held hypothesis regarding the correlations between multiple factors
93
within the OLTP system data. For example, the relationship between
pregnancy outcome and the gauge of needle and type of needle used for
transabdominal Chorionic Villus Sampling (CVS). This test would be
considered successful if the outcomes were considered to improve the
effectiveness of the clinician’s research analysis.
An additional aim was generation of domain knowledge using rule
generating algorithms
5.3 Methods
Fetal-maternal Clinicians did not have the knowledge or experience to
effectively consolidate the existing disparate patient data into a cohesive,
well defined, clean data set ready for clinical analysis and research purposes.
Numerous piecemeal attempts had been made using spreadsheets and
manual paper-based methods. There was a clear need for custom data
cleaning processes, a fetal-maternal data warehouse, standard statistical
component, KDD/DM component and importantly a reporting component.
During the Build phase of the constructivist research method it became clear
that due to poor quality data, described further below, it would be necessary
to re-define the requirements and limit the scope to Chorionic Villus
Sampling (CVS) data only. This gave rise to a CVS data mart rather than a
fully implemented data warehouse covering the scope of all maternal and
fetal data.
The DataBabes architecture includes the fundamentals of Knowledge
Discovery in Data (KDD) architectural components which are required to
facilitate CRISP-DM phases. Figure 5.1 below, illustrates disparate data
sources, custom extraction, transformation and loading components, a data
mart and proprietary data mining software tool, generating reports presenting
various aspects of clinical evidence and discovered knowledge. Figure 5.1
also illustrates the use of a Chorionic Villus Sampling Domain Ontology.
This ontology is populated by the data definitions and relationships suitable
for use with the Chorionic Villus Sampling data mart.
94
Extraction
Transformation
Loading
Components
ViewPoint
Database
DOS legacy
Fetal
Data
Warehouse
Components
"
Database
Extract
Transform
Load (ETL)
Process
Chorionic
Villus
Sampling
Data Mart
Data Mining
/Knowledge
Discovery in
Data(KDD)
Components
Data Mining /
KDD
0
Data
Source
1
$
#
The fundamental KDD architecture components as described by Han and
Kamber (2001), Mallach (2000) and Inmon (2002) have been customised for
the DataBabes research project. The FMMU environment includes disparate
legacy and current production OLTP systems, including an archived DOS
based mother/fetus database containing clinical and pathological markers. A
similar, but not identical, dataset is found in the current production OLTP,
relational database, GE ViewPoint™.
A star-schema, as described by Inmon (2002) has been used for CVS data
mart construction. CVS procedures populate the fact table with patient
details and procedure details are dimension tables in the star schema, as
illustrated in Figure 5.2.
95
3
$
12
+
12
+ %
$
$%
3
%
& #
$
&
' 89: (
/ ' 89: (
56 ;9: (
;< 8=: (
" *
* (7 (
/
+
32
&%'
$
$
&3
$*
*
1
$
%
$
%
3
%
0 %
*
$
"
12
"
1
%
0
1
$
% %
%
%
+ %
$
3
+ %
%
%
' %
%
$1
$%
1
1
1
&
/ 2
%
"
$
0
*
'
'
"
4
4
4
+
+
+
'
,
4
4
!
'
/2
1)
!
+
3
+
3
+
+
"
&
3 &#
"
3
&
2$
$
"
%
% ,
$
*
"
/ 2
0 1
1
%
1
1
!
96
3
3
3
5
6
+
$
" + %
0 '
0 '
/2
1)
0
3
0
%
'
%
%
*
*
'
31
1
# $
"
$
3
%
$
!
5.4 Results
Results of the case study can be viewed in two ways: (1) quantitative results
from data mining activities and (2) qualitative outcomes from undertaking
the data warehousing project. Results for (1) are disappointing with many
missing data values and incomplete patient records hindering the creation of
worthwhile data mining outcomes. However, this should not been seen as a
wasted opportunity because through the conduct of the CVS analysis and
data mining the fetal-maternal clinicians became far more aware of the
importance of accurate data recording for strategic research purposes. Local
data collection procedures were improved and data accuracy raised at fetalmaternal unit staff meetings. In addition, a part time research clinician
position was established to focus on improving the existing datasets by
recording missing data values by refering to data sources beyond the
confines of the fetal-maternal unit, such as the NSW midwives database.
Thus, the qualitative outcomes (2) were encouraging and a further
opportunity to conduct CVS data mining will arise following data
improvement.
5.5 Conclusions
The lead researchers at the 2004 Pacific Asia Knowledge Discovery in Data
(2004) conference were correct to draw new researchers attention to the
challenge of ‘real world’ data sets. This case study exposed many of the
issues the leading researchers highlighted – particularly the poor quality of
real world data and complex nature of medical domain knowledge.
However, given the positive response from the Director and Clinicians at the
Liverpool fetal-maternal unit the resulting improvements in data quality
inspire future research.
The paradigm shift towards improving data capture during clinical sessions
offers the potential to significantly improve the quality of the clinical
research conducted in the fetal-maternal unit. However, this case study
97
demonstrates there are currently several factors impacting its mainstream
adoption including data accuracy issues and senior hospital management
reluctance to embrace exploratory data mining across patient data sets.
Attempts to gain follow-on ethics approvals – after the granting of the initial
approval – were unsuccessful. This was largely due to the concerns some
senior hospital management had regarding the potential exposure of
unexpected relationships across the patient data attributes. It should be noted
that many clinicians and senior management were very interested in this case
study and gave their full support to the ongoing research into IDSS in the
medical
domain.
This
attitude
98
was
not
shared
by
all.
CHAPTER 6 CONCLUSION
6.1 Contribution to Knowledge
This work began with identification of open research areas including:
1. Existing investigative methods used when data mining across patient
medical data are inadequate for the demands of clinical practice and
research. The null hypothesis driven medical research paradigm must inform
data mining investigative methods in the medical domain.
2. In the medical domain improvement is required in the elicitation of
domain knowledge for use within knowledge bases in IDSS.
3. The exploitation of IDSS in the medical domain, particularly in the
Australian context, has been slow and clinicians have concerns regarding the
content of knowledge bases found in IDSS.
4. Consideration of the feasibility of extending the CRISP-DM to enable its
use in medical research driven by the null hypothesis paradigm.
This thesis addressed these open research issues by making the following
contributions to knowledge:
1.
Existing investigative methods used when data mining across patient
medical data are inadequate for the demands of clinical practice and
research. The null hypothesis driven medical research paradigm must
inform data mining investigative methods in the medical domain.
This research defined extensions to the CRISP-DM to facilitate its use in
clinical practice and medical research applications. In addition, this research
exposed the parallelism between CRISP-DM and the Scientific Method and
the importance of the role played by both exploratory and confirmatory data
mining.
99
2.
In the medical domain improvement is required in the elicitation of
domain knowledge for use within knowledge bases in IDSS.
My extended CRISP-DM offers a largely automated approach to data mining
that generates electronic rule-sets based on clinical evidence captured and
stored in electronic OLTP systems. This rule representation of knowledge is
then suitable for use in knowledge bases in IDSS.
3.
The exploitation of IDSS in the medical domain, particularly in the
Australian context, has been slow and clinicians have concerns regarding
the content of knowledge bases found in IDSS
The framework for IDSS proposed in this research has been developed in
collaboration with an Australian fetal-maternal medicine unit. The evidence
based rule-sets are thus generated from the local datasets. The extended
CRISP-DM offers an approach using an investigative method akin to the
null hypothesis driven scientific method thus providing confidence regarding
rigour and statistical significance.
4.
Consideration of the feasibility of extending the CRISP-DM to
enable its use in medical research driven by the null hypothesis paradigm.
The extended CRISP-DM presented in this research demonstrates the
manner in which this generic approach can be extended to enable its use in
important medical research driven by the null hypothesis paradigm.
6.2 Future Research
Specific future research in the fetal-maternal domain includes the
development of a domain ontology. The details of sample attributes
described in the Chapter 5 case study represent approximately 15% of the
factors in this domain. The volume and complexity of factor relationships is
vast yet the benefits of a fetal-maternal ontology would be immediately
appreciated in the IDSS context. Similarly the capture of domain knowledge
in a machine readable format would also benefit the fetal-maternal
community through provision of a knowledge base for IDSS purposes. The
development of methodologies and tools to capture this domain knowledge
100
has scope well beyond the fetal-maternal area into the broader health
environment.
Future research regarding the extended CRISP-DM could investigate the
number of significant rule sets generated v’s spurious or meaningless rule
sets for a given medical dataset. Developing metrics to measure the validity
or impact of unexpected knowledge exposed during the exploratory data
mining phase of the extended CRISP-DM is also an interesting area for
future research.
Finally, the development of an application to support the clinicians as they
work through the additional, medical specific layers in CRISP-DM when
used for clinical research would be a valuable, marketable future direction
for research. This would be a change from most of the field’s earlier work
which directed efforts to building tools for data mining ‘experts’ rather than
domain knowledgeable end-users.
6.3 Conclusion
A conclusion that can be drawn from this research is that there are strong
parallels between the widely accepted CRISP-DM and the long established
scientific method, as illustrated the diagram from Figure 3.3. Exploratory
data mining and confirmatory data mining play an important part in the
extended CRISP-DM methodology.
The additional layers proposed within the CRISP-DM for medical research
have been shown to accommodate the demands of the null-hypothesis
paradigm. The feasibility of extending CRISP-DM to enable its use in
medical research driven by the null hypothesis paradigm has been
demonstrated. The layers : DataMining Rule-Set Generation, Selecting
significant Rule-Sets, Formulating the Null Hypothesis, Running statistical
processes to test the null hypothesis and finally loading accepted rule-sets
into associated IDSS have been defined within this thesis.
101
The establishment of the extended CRISP-DM, utilizing additional proposed
layers, enables this to be applied to medical research in accordance with
Research Hypothesis 1.
The framework for IDSS to support the extended data mining methodology
has been defined, as described below. The importance of the domain
ontology and knowledge base has been presented, specifically for the fetalmaternal domain. The extended CRISP-DM methodology operates within
the highlighted Data Management component. The generation of rule-sets
from the data management component ‘feed’ into the rule based knowledge
representation.
The importance of the integration between the extended CRISP-DM
methodology and the supply of domain knowledge to meet the requirements
of the proposed IDSS framework has been presented. The resultant IDSS
with a data management component capable of exploiting the extended
CRISP-DM methodology, as per Research Hypothesis 1, addresses the
demands of Research Hypothesis 2.
Research Hypothesis 1: The Cross Industry Standard Process – Data
Mining(CRISP-DM) can be extended to enable its use in medical
research driven by the null hypothesis paradigm
and
Research Hypothesis 2: An Intelligent Decision Support System
(IDSS) can be defined for clinical practice and research including a
data management component to exploit the extended CRISP-DM
methodology.
have both been addressed and shown to provide an integrated research
outcome, specifically an extended CRISP-DM for use with null hypothesis
driven research and a supporting IDSS framework for clinical practice and
research.
Therefore,
these
102
hypotheses
have
been
proven.
REFERENCES
Alavi, M., & Carlson, P. (1992). A review of MIS Research and Disciplinary
Development. Journal of Management Information Systems, 8(4), 45-62.
Alexandrini, F., Krechel, D., Maximini, K., & von Wangenheim, A. (2003).
Integrating CBR into the health care organization. Paper presented at the 16th
IEEE Symposium on Computer-Based Medical Systems, New York, New
York, USA.
Amardeilh, F., Laublet, P., & Minel, J. (2005). Documentation Annotation and
Ontology Population from Linguistic Extractions. Paper presented at the KCAP '05, Banff, Alberta, Canada.
Anderson, G. Lecture 1: Scientific Method. Retrieved February 8, 2006, from
http://pasadena.wr.usgs.gov/office/ganderson/es10/lectures/lecture01/lecture01
.html
Anthony, R.N. (1965). Planning and Control Systems: A Framework for Analysis.
Cambridge, MA., Harvard University Graduate School of Business
Management.
Australian Health Information Council. (2003). Electronic Decision Support for
Australia's Health Sector . Retrieved 11 May 2006, from www.ahic.org.au
Australian Health Information Council. (2003) Electronic Decision Support
Evaluation Methodology. Retrieved 11 May 2006, from
http://www.ahic.org.au/evaluation/guidelines.htm
Avison, D. (2002). Action Research: A Research Approach for Cooperative Work.
Paper presented at the 7th International Conference on Computer Supported
Cooperative Work in Design, Rio de Janeiro, Brazil.
Avison, D., Lau, F., & Myers, MD. (1999). Action Research. Communications of the
ACM, 42(1), 94-97.
Babcock, B., Babu, S., Datar, M., Motwani, R. & Widom, J. (2002). Models and
Issues in Data Stream Systems. Paper presented at the 21st ACM SIGMODSIGART Symposium on Principles of Database Systems, Madison,
Wisconsin.
Beckett, D. (2004). Scalable RDBMS report. Retrieved 4 June 2004, from
www.w3.org/2001/sw/Europe/reports/scalable_rdbms_mapping_report
Becquet, C., Blachon, S., Jeudy, B., Boulicaut, JF., Gandrillon, O. (2002). Strongassociation-rule mining for large-scale gene-expression data analysis: a case
study on human SAGE data. Genome Biology, 3(12).
Bench-Capon, T., Coenen, F., Nwana, H., Paton, R., Shave, M. (1993). Two aspects
of the validation and verification of knowledge based systems. IEEE Expert,
8(3), 76-81.
Bench-Capon, T., & Visser, P. (1997). Ontologies in Legal Information Systems; The
Need for Explicit Specifications of Domain Conceptualisations. Paper
presented at the 6th International Conference on AI and Law, Melbourne,
Victoria, Australia.
Berndt, DJ., Fisher, JW., Hevner, AR., & Studnicki, J. (2001). Healthcare Data
Warehousing and Quality Assurance. Computer, December 2001, 56-65.
Blackmore, K., & Bossomaier, T.R.J. (2002). Soft computing methodologies for
mining missing person data. In Proceedings of Sixth Australia-Japan Joint
Workshop on Intelligent and Evolutionary Systems (AJJWIES 2002),
Canberra, ACT, Australia.
103
Bonifati, A., Cattaneo, F., Ceri, S., Fuggetta, A. & Paraboschi, S. (2001). Designing
Data Marts for Data Warehouses. ACM Transactions on Software Engineering
and Methodology, 10(4), 452-483.
Boose, J. (1985). A Knowledge Acquisition Program for Expert Systems Based on
Personal Construct Psychology. International Journal of Man Machine
Studies, 23, 495-525.
Boucelma, O., Castano, S., & Goble, C. (2002). Report on the EDBT'02 Panel on
Scientific Data Integration. SIGMOD Record, 31(4).
Brossette, S., Sprague, A., Hardin, M., Waites, K., Jones, W., & Moser, S. (1998).
Association rules and data mining in hospital infection control and public
health domain. Journal of American Medical Informatics Association, 5(4),
373-381.
Cabena, P., Hadjinian, P., Stadler, R., Verhees, J. & Zanasi, A. (1997). Discovering
Data Mining from Concept to Implementation. New Jersey, USA: PrenticeHall PTR.
Carey, S. (1994). A Guide to the Scientific Method. California, USA: Wadsworth
Publishing Company.
Cesnik, B. (2002). Report of the Electronic Decision Support Governance Workshop.
Retrieved 11 May 2006, from www.ahic.org.au
Connolly, T., & Begg, C. (2005). Database Systems: A Practical Approach to Design,
Implementation and Management (4th ed.). England: Addison-Wesley.
Corsi, P., Weindling, P. (Eds.) (1983) Information Sources in the History of Science
and Medicine. London: Butterworths.
CRISP-DM. (2004). Retrieved 1 December 2004, from www.crisp-dm.org
Devlin, B. (1997). Data Warehouse- from Architecture to Implementation. Reading
Mass: Addison Wesley.
DxPlain (2006) Lab of Computer Science, Massachusetts General Hospital.
Retrieved 3 February 2006, from
http://www.lcs.mgh.harvard.edu/projects/dxplain.html
ETL Portal. (2006). DM Review Retrieved 1 November 2006, from
http://www.dmreview.com/portals/portal.cfm?topicId=230206
Ewen, E., Medsker, C., Dusterhoft, L., Levan-Schultz, K., Smith, J., & Gottschall, M.
(1999). Data Warehousing in an Integrated Health System; Building the
Business Case. ACM, 47-53.
Gahleitner, E., Behrendt, W., Palkoska, J., Weippi, E. (2005). On Cooperatively
Creating Dynamic Ontologies. Paper presented at the ACM HT'05, Salzburg,
Austria.
Galliers, R.D. (1993). Research Issues in information systems. Journal of Information
Technology, 8, 92-98.
Golfarelli, M., Rizzi, S., & Vrdoljak, B. (2001). Data warehouse design from XML
sources. Paper presented at the 4th ACM International Workshop on Data
Warehousing and OLAP, Atlanta, Georgia, USA.
Gomez-Perez, A. (2004). Retrieved 4 June 2004, from ontoweb.aifb.unikarlsruhe.de/Members/ruben/Deliverable%201.5
Goodwin, L., & Grzymala-Busse, J. (2001). Data Mining Approaches for Perinatal
Knowledge Building. Handbook of Data Mining and Knowledge Discovery.
New York: Oxford University Press.
Goodwin, L., Iannacchione, A., Hammond, W., Crockett, P., Mahler, S., & Schlitz, K.
(2001). Data Mining Methods Find Demographic Predictors of Preterm Birth.
Nursing Research, 50(6), 340 - 345.
104
Goodwin, L., Maher, S., Ochno-Machado, L., Iannacchione, M., Crockett, P.,
Dreiseitl, S., Vinterbo, S., & Hammond, W. (2000). Building Knowledge in a
Complex Preterm Birth Problem Domain. Paper presented at the AMIA
Annual Fall Symposium, Philadelphia.
Gorry, G.A, & Scott Morton, M. (1971). A Framework for Management Information
Systems. Sloan Management Review, 13, 55-70.
Graziano, A., & Raulin, M. (2003). Research Methods a Process of Inquiry (5th ed.).
Boston: Pearson Education.
Gross- Portney, L., & Watkins, M. (2000). Foundations of Clinical Research,
applications to practice. (7th ed.). New Jersey: Prentice Hall Health.
Gruber, TR. (1992). ONTOLINGUA: A Mechanism to Support Portable Ontologies,
technical report: Knowledge Systems Laboratory, Stanford University
Calfornia, USA.
Hagland, M. (2004). Health Care Informatics Online Data Mining. Health Care
Informatics. Retrieved 1 April 2004 from http://www.healthcareinformatics.com/
Han, J. (1995). Mining Knowledge at Multiple Concept Levels. Paper presented at the
4th International Conference on Information and Knowledge Management.
Baltimore, Maryland, USA.
Han, J. (1996). Data mining techniques. ACM SIGMOD Record, Proceedings of 1996
ACM SIGMOD international conference on management of data SIGMOD'96,
25(2), Montreal, Quebec, Canada.
Han, J. (1998). Towards on-line analytical mining in large databases. ACM SIGMOD
Record, 27(1).
Han, J. (2002). Evolving data mining into solutions for insights: Emerging scientific
applications in data mining. Communications of the ACM, 45(8), 54-58.
Han, J., Chiang, J., Chee, S., Chen, J., Chen, Q., Cheng, S., Gong, W., Kamber, M.,
Koperski, K., Liu, G., Lu, Y., Stefanovic, N., Winstone, L., Xia, B., Zaine, O.,
Zhang, S., & Zhu, H. (1997). DBMiner: a system for data mining in reltaional
databases and data warehouses. Paper presented at the 1997 Conference of
the Centre for Advanced Studies on Collaborative research. Toronto, Canada.
Han, J, & Kamber, M. (2001). Data Mining Concepts and Techniques (1 ed.). San
Francisco: Morgan Kaufmann Publishers.
Han, J., & Pei, J. (2000). Mining frequent patterns by pattern-growth: methodology
and implications. ACM SIGKDD Explorations Newsletter, 2(2), 14-20.
Han, J., Pei, J., & Yin, Y. (2000). Mining frequent patterns without candidate
generation. ACM SIGMOD Record, SIGMOD'00, 29(2), 1-12.
Hayes, P., Reichherzer, T., & Mehrotra, M. (2005). Collaborative Knowledge Capture
in Ontologies. Paper presented at the K-CAP '05, Banff, Alberta, Canada.
Heath, J., Heath, S., McGregor, C., Smoleniec, J. (2004). DataBabes: A Case Study in
Data Warehousing and Mining Perinatal Data. Paper presented at
CASEMIX, Sydney, Australia.
Heath, J., & McGregor, C. (2004). Research Issues in Intelligent Decision Support.
Paper presented at the UWS College of Science, Technology and
Environment, Innovation Conference, Sydney, Australia.
Heath, J., McGregor, C., & Smoleniec, J. (2005). DataBabes: A Case Study in FetoMaternal Clinical Data Mining. Paper presented at the Health Informatics
Conference of Australia, Melbourne.
105
Hummer, W., Bauer, A., & Harde, G. (2003). XCube - XML For Data Warehouses.
Paper presented at the 6th ACM International Workshop on Data Warehousing
and OLAP, New Orleans, Louisiana, USA.
Inmon, W. (2002). Building the Data Warehouse (3rd ed.)New York: Wiley.
Ji, W., Naguib, R.N.G., & Ghoneim, M.A. (2003). Neural network-based assessment
of prognostic markers and outcome prediction in bilharziasis-associated
bladder cancer. Information Technology in Biomedicine, IEEE Transactions
on, 7(3), 218-224.
Johnson, S.B. (2004). The development of decision support systems to enable plant
demographic research in the Australian cotton industry. Australia:
Department of Primary Industries.
Jung, G., & Gudivada, V. (1995). Automatic determination and visualisation of
relationships among symptoms for building medical knowledge bases.Paper
presented at the 1995 ACM Symposium on Applied Computing, Nashville
Tennessee, USA.
Kawamoto, K., Houlihan, C., Balas, E., & Lobach, D. (2005). Improving clinical
practice using clinical decision support systems: a systematic review of trials
to identify features critical to success. British Medical Journal, 330(7494).
Kennedy, P. (2004). Extracting and Explaining Biological Knowledge in Microarray
Data. Paper presented at the Pacific Asia Knowledge Discovery in Data
(PAKDD) 2004, Sydney, Australia.
Kimball, R. (1996). The Data Warehouse Toolkit. New York: Wiley.
Kock, N.F, Avison, D., Baskerville, R., Myers, M. & Wood-Harper, T. (1999). IS
Action Research: Can We Serve Two Masters? Paper presented at the 20th
International Conference on Information Systems, Charlotte, North Carolina,
USA.
Kock, N.F, McQueen, R.J, Baker, M., (1996). Negotiation In Information Systems
Action Research. Paper presented at the Information Systems Conference of
New Zealand, Palmerston North, New Zealand.
Kovalerchuk, B., Vityaev, E., Ruiz, J.F. (2000). Consistent knowledge discovery in
medical diagnosis. IEEE Engineering in Medicine and Biology Magazine,
19(4), 26-37.
Lee, S., Abbott, & P. (2003). Bayesian networks for knowledge discovery in large
datasets: basics for nurse researchers. Biomed Inform, 36, 389-399.
Little, J.D. (1970). Models and Managers: The Concept of a Decision Calculus.
Management Science, 16(8), 466-485.
Lord,S., Genski, V., & Keech, C. (2004). Multiple analyses in clinical trials: sound
science or data dredging? Medical Journal of Australia, 181(8), 452-454.
Lyman, J., Boyd, J., & Dalton, J. (2003). Applying the HL7 reference information
model to a clinical data warehouse. Paper presented at the IEEE International
Conferemce on Systems, Man and Cybernetics, 2003., Washington DC, USA.
Mackinnon, J., & Glick, N. (1999). Data Mining and Knowledge Discovery in
Databases - An Overview. Australian and New Zealand Journal of Statistics,
41(3), 255-275.
Mallach, E. (2000). Decision Support and Data Warehouse Systems, New York: Irwin
McGraw-Hill.
Marakas, G.M. (2002a). Decision Support Systems in the 21st Century. Upper Saddle
River, New Jersey: Prentice Hall.
Marakas, G.M. (2002b). Modern Data Warehousing, Mining and Visualisation.
Upper Saddle River, New Jersey: Prentice Hall.
106
Masuda, G., Sakamoto, N., & Yamamoto, R. (2002). A Framework for Dynamic
Evidence Based Medicine using Data Mining. Paper presented at the 15th
IEEE Symposium on Computer-Based Medical Systems, Maribor, Slovenia.
Masuda, G., & Sakamoto, N. (2002). A framework for dynamic evidence based
medicine using data mining. Paper presented at the 15th IEEE Symposium on
Computer-Based Medical Systems, 2002. (CBMS 2002), Maribor, Slovenia.
Mathews, J.R. (1995). Quantification and the Quest for Medical Certainty.New
Jersey: Princeton University Press.
Matsumoto, T., Ueda, Y., & Kawaji, S. (2002). A software system for giving clues of
medical diagnosis to clinician. Paper presented at the 15th IEEE ComputerBased Medical Systems, 2002. (CBMS 2002), Maribor, Slovenia.
McCarthy, J. (2000). Phenomenal Data Mining: From Data to Phenomena. ACM
SIGKDD Explorations Newsletter, 1(2), 24-29.
McGregor, C., Bryan, G., Curry, J., Tracey, M. (2002). The e-Baby Data Warehouse:
A Case Study. Paper presented at the 35th Hawaii International Conference on
System Sciences, Hawaii, USA.
Miquel, M., & Tchounikine, A. (2002). Software components integration in medical
data warehouses: a proposal. Paper presented at the 15th IEEE Symposium on
Computer-Based Medical Systems, 2002. (CBMS 2002), Maribor, Slovenia.
Ohsaki, M., Sato, Y., Kitaguchi, S., Yokoi, H., & Yamaguchi, T. (2004). Comparison
between objective interestingness measures and real human interest in
medical data mining. Paper presented at the 17th International Conference on
Innovations in Applied Artificial Intelligence, Ottawa, Canada.
Ohsaki, M., Kitaguchi, S., Yokoi, H., & Yamaguchi, T. (2005). Investigation of Rule
Interestingness in Medical Data Mining. Active Mining, Springer(3430), 174189.
Pedersen, T., & Jensen,C. (1998). Research Issues in Clinical Data Warehousing.
Paper presented at the 10th International Conference on Scientific and
Statistical Database Management, Capri, Italy.
Piantadosi, P. (1997). Clinical Trials- A Methodologic Perspective (1st ed.). New
York: John Wiley & Sons.
Podgorelec, V., Kokol, P., & Stiglic, M. (2002). Searching for new patterns in
cardiovascular data. Paper presented at the 15th IEEE Symposium on
Computer-Based Medical Systems, 2002. (CBMS 2002), Maribor, Slovenia.
Popp, R., Armour, T., Senator, T., & Numryk, K. (2004). Countering Terrorism
Through Information Technology. Communications of the ACM, 47(3), 36-43.
Povalej, P., Lenic, M., Zorman, M., Kokol, P., Peterson, M., & Lane, J. (2003).
Intelligent data analysis of human bone density. Paper presented at the 16th
IEEE Computer-Based Medical Systems, 2003, New York, New York.
Qiao, L., Agrawal, D., & Abbadi,A. (2003). Supporting Sliding Windows Queries for
Continuous Data Streams. Paper presented at the 15th International
Conference on Scientific and Statistical Database Management, Cambridge,
MA, USA.
Raghupathi, Winiwarter, Werner, & Tan,J. (2002). Strategic IT Applications in Health
Care. Communications of the ACM, 45(12), 56-61.
Rao, R., Niculescu, R., Germond, C., Rao, H. (2003). Clinical and Financial
Outcomes Analysis with Existing Hospital Patient Records. Paper presented at
the SIGKDD, Washington DC.
Rindfleisch, T. (1997). Privacy, Information Technology and Health Care.
Communications of the ACM, 40(8), 93-100.
107
Robinson, J.B. (2005). Understanding and Applying decision support systems in
Australian farming systems research. University of Western Sydney, Sydney.
Roddick, J., Fule, P., & Graco,W. (2003). Exploratory Medical Knowledge
Discovery: Experiences and Issues. SIGKDD Explorations Newsletter, 5(1),
94-99.
Roiger, R., & Geatz, M. (2003). Data Mining, England: Addison Wesley.
Sabou, M., Wroe, C., Goble, C., & Mishne, G. (2005). Learning Domain Ontologies
for Web Service Descriptions: an experiment in Bioinformatics. Paper
presented at the IW3C2, Chiba, Japan.
Sackett, D., Rosenberg, W., Muir Gray, J., Haynes, B., & Scott-Richardson, W.
(1996). Evidence based medicine: what it is and what it isn't. British Medical
Journal, 312, 71-71.
Schubart, J., & Einbinder, J. (2000). Evaluation of a data warehouse in an academic
health sciences center. International Journal of Medical Informatics, 60(3),
319-333.
Simon, H.A. (1960). The New Science of Management Decision. New York: Harper
and Collins.
Summons, P., Giles, W., & Gibbon,G. (1999). Decision Support for Fetal Gestation
Age Estimation. Paper presented at the 10th Australiasian Conference on
Information Systems, Wellington, New Zealand.
Susman, G.I, & Evered, R.D. (1978). An Assessment of the Scientific Merits of
Action Research. Administrative Science Quarterly, 23, 582-603.
Sydney South West Area Health Service. (2006). Retrieved 27th December 2005,
2005, from http://www.sswahs.nsw.gov.au/Service_Facility.aspx
Tsymbal, A., Cunningham, P., Pechenizkiy, M., & Puuronen, S. (2003). Search
strategies for ensemble feature selection in medical diagnostics. Paper
presented at the 16th IEEE Symposium on Computer-Based Medical Systems,
2003, New York, New York.
Turban, E., & Aronson J. (2001). Decision Support Systems and Intelligent Systems.
Upper Saddle River, NJ: Prentice Hall.
Upadhyaya, S., & Kumar, P. (2005). ERONTO: A Tool for Extracting Ontologies
from Extended E/R Diagrams. Paper presented at the SAC'05, Santa Fe, New
Mexico, USA.
Vassiliadis, P., Simitsis, A., & Skiadopoulos, S. (2002). Conceptual Modeling for
ETL Processes. Proceedings of 5th ACM International Workshop on data
warehousing and OLAP, McLean, VA, USA, 14-21.
Wang, H., Fan, W., Yu, S., & Han, J. (2003). Mining concept-drifting data streams
using ensemble classifiers. Paper presented at the 9th ACM SIGKDD,
Washington DC, USA.
Warren, J., & Stanek, J. (2005). Decision Support Systems. In Conrick & M (Eds.),
Health Informatics Transforming Healthcare with Technology (pp. 252-265).
Melbourne: Thomson.
Webb, G., Han, J., & Fayyad, U. (2004). Panel Discussion. Paper presented at the 8th
Pacific Asia Knowledge Discovery in Data, Sydney, Australia.
Webb, G.I. (2001, August 2001). Discovering associations with numeric variables.
Paper presented at the 7th ACM SIGKDD international conference on
knowledge discovery and data mining, Boston, MA, USA.
Webb, G.I, Butler, S., & Newlands, D. (2003). On detecting differences between
groups. Paper presented at the 9th ACM SIGKDD international conference on
knowledge discovery and data mining, Washington DC, USA.
108
Webb, G.I. (2000). Efficient search for association rules. Paper presented at the 6th
ACM SIGKDD international conference on knowledge discovery and data
mining, Boston, MA, USA.
Wong, M.L., Lam, W., Leung, K. S., Ngan, P. S., & Cheng, J.C.Y. (2000).
Discovering knowledge from medical databases using evolutionory
algorithms. IEEE Engineering in Medicine and Biology Magazine, 19(4), 4555.
Xintao, W., & Daniel, B. (2002). Learning missing values from summary constraints.
ACM SIGDDD Explorations Newsletter, 4(1).
Xu, Z., Cao, X., Dong., Y., & Wenping, S. (2004). Formal Approach and Automated
Tool for Translating ER Schemata into OWL Ontologies, 8th Pacific Asia
Knowledge Discovery in Data Conference, 2004, Sydney, Australia.
Yu, C. (2004). A web-based consumer-oriented intelligent decision support system for
personalized e-service. Paper presented at the 6th International conference on
electronic commerce ICEC '04, Delft, The Netherlands.
Yu, & P. (2004). Keynote Address. Paper presented at the 8th Pacific Asia Knowledge
Discovery in Data, Sydney, Australia.
Zaidi, S., Abidi, S., & Manickam, S. (2002). Distributed data mining from
heterogeneous healthcare data repositories: towards an intelligent agentbased framework. Paper presented at the 15th IEEE Symposium on ComputerBased Medical Systems, 2002. (CBMS 2002)Maribor, Slovenia.
Zdanowicz, J. (2004). Detecting Money Laundering and Terrorist Financing with
Data Mining. Communications of the ACM, 47(5), 53-55.
Zeleznikow, J., & Nolan, J. (2001). Using Soft Computing to build real world
intelligent decision support systems in uncertain domains. Decision Support
Systems, 31, 263-285.
Zorman, M., Kokol, P., Lenic, M., Povalej, P., Stiglic, B., & Flisar, D. (2003).
Intelligent platform for automatic medical knowledge acquisition: detection
and understanding of neural dysfunctions. Paper presented at the 16th IEEE
symposium on Computer-Based Medical Systems, 2003, New York, New
York.
109
Appendix A
Appendix A contains a sample set of attributes to convey the complex, timevarying, multi-dimensional nature of the fetal-maternal data. It is estimated
that only approximately 20% of the attributes commonly found in the
databases that support fetal-maternal are listed in Appendix A.
MaternalPatient
PatientId(X)
EpisodeID(X)
StreetNumber&Name(T)
Suburb(T)
State(T)
Country(T)
Postcode(N)
DateOfBirth(D)
Email(T)
Ethnic Group(EC)
FamilyDrawing(BLOB)HomePhone(N)
MobilePhone(N)
WorkPhone(N)
MedicareNumber(N)
MedicareLineNum(N)Insurance(B)
InsuranceProvidor(T)
InsurancePolicyNumber(T)
MaidenName(T)
Title(T)
Surname(T)
FirstName(T)
MiddleName(T)
Occupation(EC)
Religion(EC)
HospitalNumber(X)
Fetus
MaternalPatientID(X)
EpisodeID(X)
FetusID(X)
3Ventricle(N)
4Ventricle(N)
4Chamber(N)
Abdomen(EC)
AbdomenDescription(T)
AbnormalVenouseReturn(B)
Acardia(B)
Achondrogenesis(B)
AchondrogenesisType(T)
AD1(N)
AdditionalBiometry(B)
AF_Comment(T)
AFDeepestPool(N)
AFDeepestPool$(B) AFIndex(N)
AFIndex$(B)
AFLeftLowerPool(N)
AD2(N)
AFLeftUpperPool(N)
AFRightLowerPool(N)
AFRightUpperPool(N)AortaDiam(N)
AortaStenosis(B)
AortaStenosis1(T)
AorticCoarction(B)
AorticIsthmusStenosis(B)
AorticValveAtresia(B)
ArCyst1(N)
110
ArCyst2(N)
ArnoldChiariA(B)
ArCyst3(N)
Arachnoid(B)
ArnoldChiariB(B)
ArterialDoppler(B)
ArthrogryposisMultiplex(B)
ASD1(B)
ASD2(B)
AVSeptal(B)
Balkenaplasia(B)
BladderEntrophy(B)
BodyStalkAnomaly(B)
BPD(N)
BPDFL(N)
BPDOFD(N)
BPF1(N)
BPF2(N)
BPF3(N)
BPF4(N)
BPF5(N)
BPF6(N)
BPFAccelerations(N)
BPFAFVolume(N)
BPFBodyMovements(N)
BPFPlacentalGrading(N)
BPFRespiratoryMovements(N)
BPFScore(N)
BPFSystem(N)
BPFTone(N)
Brachycephaly(B)
Brain(N)
BrandefOther(T)
Brandeftext(T)
BrochogenicCysts(B)
BrochogenicCystsA(N)
BrochogenicCystsDiam(N)
Calcification(B)
CAM(B)
CAMSite(N)
CAMType(N)
CamptomelicDysplasia(B)
CardDiamAP(N)
CardDiamT(N)
CardiaDiamT(N)
CardiacTumour(B)
CC(N)
CCTC(N)
CDH(B)
CDHSite(T)
Chest(N)
Chest$(B)
ChestOther(T)
ChestText(T)
ChestWallA(B)
ChestWallB(B)
ChestWallC(B)
ChestWallD(B)
ChondrodystrophicDystrophy(B)
Chrom2(T)
Chrom1(T)
Chromosomes(EC)
CloverLeafShape(B)
CloacalExtrophy(B)
CM(N)
Coarctation(B)
COMCarotidEDF(N)
ComCarotidPI(N)
CpmCarotidRI(N)
ComCarotidVm(N)
ComCarotidVmax(N)
Cord(N)
CordChoriangioma(B)
CordChoriangiomaSize(N)
CordCysts(T)
CordKnot(B)
CordRoundNeck(B)
111
CordSingleArtery(B)
CordSiteDetails(T)
CordDescription(T)
CysticHygromas(B)
CysticHygromasA(N)
CRL(N)
CysticHygromasB(N)
Cysts(B)
DawesRedmanCriteria(N)
Degree(N)
DHContentOther(T)
DHContentA(B)
DHContentB(B)
DHContentC(B)
DHContentD(B)
Diagnosis(T)
DILDetail(N)
Dilation(B)
DilatedCM(B)
DirectPrep(N)
Dolichocephaly(B)
DoubleOutletLV(B)
DoubleOutletRV(B)
DPrepText(T)
DRMinutes(N)
DysDetails(T)
Dysrhythmia(B)
EarlyPregBiom(B)
Ebstein(B)
EchocardiographyDesc(T)
EctopiaCordis(B)
EFWMethod(N)
EllisVanCrefeld(B)
EmbryoStructure(N)
EncephDetails(T)
Encephalocele(B)
EPAbdomen(T)
EPBiometryDesc(T)
EPBladder(T)
EPBrain(T)
EPFeet(T)
EPHands(T)
EPMalformation(B)
EPMalformationDesc(T)
EPOther(T)
EPSkull(T)
EPSpine(T)
EPStomach(T)
EstWeight(N)
ESTWeightLbs(N)
EstWeightOz(N)
Exencephaly(B)
Exomphalos(B)
ExomphalosBladder(B)
ExomphalosBowel(B)
ExomphalosHeart(B)
ExomphalosLiver(B)
ExomphalosMeas(B)
ExomphalosMesentery(B) ExomphalosStomach(B)
FemurR(N)
FetalHeartActivity(T)
FetalMovements(EC)
FibulaL(N)
FibulaR(N)
FirstTrimesterRisk(B)
FootLeft(N)
FootRight(N)
Fetal Doppler Measurements
112
Fallot(B)
MaternalPatientID(X)
EpisodeID(X)
FetusID(X)
FetalDopplerDesc(T)
Fetal Heart Rate Measurements
MaternalPatientID(X)
EpisodeID(X)
FetalHeartRate(N)
FHRAccels(N)
FetusID(X)
FHRAccelsHour(N)
FHRBaseline(N)
FHRCategory(EC)
FHRDuration(N)
FHRHighVarHour(N)
FHRDecels(N)
FHRHighVariationEpisodes(N)
FHRLowVarHour(N)
FHRLowVariationEpisodes(N)
FHROverallVariation(N)
FHRShortTimeVariation(N)
FHRSignalLoss(N)
FHRStart(DT)
Extremities
MaternalPatientID(X)
EpisodeID(X)
FetusID(X)
Feet(N)
Hands(N)
Humerus(N)
Femur(N)
Radius(N)
Ulna(N)
Tibia(N)
Fibula(N)
Joints(N)
ExtremitiesNormal(B)
ExtremitiesDesc(T)
Face
MaternalPatientID(X)
EpisodeID(X)
FetusID(X)
Eyes(N)
Nose(N)
Palate(N)
Profile(N)
FacialCleft(B)
EarsAbnormal(B)
EyesAbnormal(B)
Macroglossia(B)
Micrognathia(B)
NoseAbnormal(B)
FacialTumour(B)
FaceOther(T)
113
!"
"
#$%
' &
(
'
#)%
&
! #)%
#$%
'
!
'
*
!
#$%
#& %
# *%
#
"
#$%
&
#)%
&
(
!
! #)%
+
#$%
.!
#)%
+
, #)%
!
-#*%
*
$
!
#$%
&
#)%
+
#& %
"
*
, #)%
&
# *%
#*%
/
#*%
!%
#$%
0 1!
2
0 1!
2
0 1!
&
#)%
#$%
!
0 1!
2
* 34 5#)%
0 1!
2
* 34 56#)%
2
* 34 5#/%
0 1!
2
*7
0 1!
2
* 7 #)%
0 1!
2
* 7 8)*8/#)%
0 1!
2
*7
0 1!
2
* 7 8)*8/
/
!
2
*27 #)%
/
!
2
*7
!
*7
9#/%
#)% /
2
#)%
#)%
#)%
9#/%
&!
#$%
&
#$%
# %
0
#*%
0
!
0
#)%
0
0
&
(
! #)%
0
"
&
-#& % 0
0
-#& %
#)%
"
0
#)%
*
# *%
114
#*%
&
0
0
-*
!
.!
#& %
#& %
-#& %
' !
#)%
#$%
0
!
#)%
0
!
0
!
0
!
!
0
!
*
0
0
0
0
!
.!
!
0
!
! #)%
0
!
#& %
0
!
"
&
0
!
' !
#)%
#)%
&
!
0
(
# *%
#& %
0
-#& %
#*%
&
#/%
"
#& %
#)%
0
# *%
0
! #*%
0
#)%
0
#)%
0
7#)%
0
3#)%
0
! #/%
0 !
115
#)%
! #& %
0
#*%
*
#/%
-#& %