Download Intuitive querying of e-Health data repositories - National e

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Relational algebra wikipedia , lookup

Microsoft Access wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Database wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

PL/SQL wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Versant Object Database wikipedia , lookup

Clusterpoint wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

Database model wikipedia , lookup

SQL wikipedia , lookup

Relational model wikipedia , lookup

Transcript
Intuitive querying of e-Health
data repositories
Catalina Hallett, Richard Power, Donia Scott
Centre for Research in Computing, The Open University
Overview






Background
Querying the CLEF database
Query editing
Answer generation
Evaluation
Conclusions
Background



CLEF (Clinical E-Science Framework) is an MRCfunded project aiming at providing a repository of
well organised data-encoded clinical histories
Current repository contains about 20,000 records of
cancer patients
The aim of the CLEF query interface is to provide
efficient access to aggregated data for:



Assisting in diagnosis and treatment
Identifying patterns in treatment
Selecting subjects for clinical trials
What does the CLEF database provide






Evidence from about 20,000 patient records, comprising 3.5
million record components (about 5GB of data)
162 queriable fields
various text-only records (non-queriable)
Two types of data:
 Structured
 Extracted from narratives by IE
Queriable data is encoded according to various medical
terminologies (SNOMED, ICD, UMLS)
There are approximately 19,500 different medical codes currently
used in the database (a relatively small subset of SNOMED and
ICD)
Queriable data

Structured data:

Demographics:


Laboratory findings:









Prescription drugs
Chemotherapy protocol
IV chemotherapy
Radiotherapy
Surgical procedures
Diagnoses



Radiology procedure, site, diagnosis, morphology, topography, report, indication, department
Treatments:


32 types of haematology findings
51 types of chemistry findings
Cytology reports
Histopathology reports
Imaging studies:


Age, gender, postal district, ethnical group, occupation
Clinical diagnosis
Cause(s) of death
Data extracted from narratives
Query interface requirements



Designed for:
 casual and moderate users, who are familiar with the semantic
domain of the repository but not with its technical implementation
 Typically clinicians or medical researchers
Should be able to:
 Allow the construction of complex queries with nested structures
and temporal expressions
 Minimise the risk of ambiguities
 Offer good coverage of the data types in the CLEF database
Should be used with:
 Minimal training
 No prior knowledge of medical terminologies, formal querying
languages, databases
Typical queries
“How many patients with AML have had a normal count after two cycles of
treatment?”
“ How many patients with primary breast cancer have relapsed in the last
five years? ”
“ What is the median time between first drug treatment for metastatic
breast cancer and death? ”
“ In breast cancer patients, what is the incidence of lymphoedema of the
arm that persists more than two years after primary surgical
treatment? ”
“ What is the average number of x-rays for patients with prostate cancer? ”
“ What is the average time between first treatment for cervical cancer and
death for patients aged less than 60 at death compared with those
aged over 60? ”
“How many patients between the ages of 40 and 60 when they were first
diagnosed with lung cancer had a platelet count higher than 300 but
a white cell count lower than 3 before the 4th cycle of any course of
chemotherapy they received during treatment? ”
Querying alternatives

SQL:
 Not appropriate for the typical CLEF user
 Requires deep knowledge of the database structure and content,
medical terminologies used in the database

Graphical interfaces:
 Have to cope with large number of parameters
 Nested structures and temporal restrictions are difficult to express

Natural Language interfaces:
 More natural and more expressive than formal querying
languages, but…




Sensitive to errors in composition, spelling, vocabulary
Normally understand only a subset of natural language
Complex queries are difficult to process
It is difficult to trace the source of errors in the result
The CLEF approach

Similar to Natural Language interfaces, however the user edits
the conceptual meaning of a query instead of its surface text

Allows users to easily construct non-ambiguous queries
Guides the users towards constructing correct queries only
(queries compatible with the content of the database)





It is semi-database independent but very domain specific
Based on the WYSIWYM technique (Power et al 98)
The query is presented to the user as an interactive text, and it is
edited by making selections on various components of the query
Each selection triggers a text re-generation process which results
into a new feedback text containing the selection the user made
Query editing
Modelling queries



There are 4 distinct sections of a query:
 A description of the subjects (in terms of demographics
information and basic diagnosis)
 A description of treatments that the subjects received
 A description of laboratory findings
 An outcome section (what do we want from the group of patients
we have just described)
Each query element can be expressed as a conjunction or
disjunction of same-type query elements, e.g.,:
 Cancer of the breast and of the lung
 Patients who received chemotherapy and radiotherapy
Some query elements can be temporally related to each other,
e.g.,:
 Patients who received chemotherapy within 5 months of surgery
 Patients alive 5 years after the diagnosis
Constraining user choices

At each step, users are only given correct choices

Choices are context dependent



Patients diagnosed with [some cancer] in [some body part]
User selects [some cancer] => “squamous cell carcinoma”
The interface restricts the choices available for [some body
part] to those sites where squamous cell carcinoma can
develop
Dealing with ambiguities



Once a query is constructed, there is only one way it can
be interpreted – there is no disambiguation task to be
performed
… but users may be misled into constructing a different
query than they intend to
Once a query is constructed and submitted, the user is
presented with a re-formulation of the query as a doublechecking mechanism. They then have the option to
revise the query or proceed with the retrieval of the
results
Answer generation



The answer set consists of an age/gender breakdown of the patients that
fulfil the query requirements
Each additional clinical feature is combined with the age/gender breakdown
to provide more detailed information
3 types of rendering:



Text
Charts
Table
Evaluation

Research questions:


Can the WYSIWYM query formulation method be
easily learned by users of CLEF?
Is it easier to formulate CLEF queries in SQL or
with the WYSIWYM query formulation method?
Evaluation procedure

Subjects:




We tested the performance of 11 subjects.
Subjects had a range of expertise in the CLEF domain -from expert (oncologist) to novice (computer scientist), but
most subjects had some medical training.
Subjects had no previous experience with the CLEF
WYSIWYM query interface, but most were aware of its
fundamental principles.
Methodology:



Subjects were given a set of four fixed queries to formulate
using the CLEF WYSIWYM query interface.
The queries were expressed in language as different as
possible from the language in the query interface.
Each subject received the queries in a different order.
Evaluation – data analysis


We recorded
 the time taken to compose each query.
 the number of operations used for constructing a
query and compared it with the optimal number of
operations (pre-computed).
We analysed whether performance, as indicated by
 Speed
 Efficiency
improves with training (experience).
Evaluation results
Time to completion

Subjects’ performance
improved dramatically
with experience.
Tim e to com pletion
7

After their first
experience of
composing a query,
subjects’ completion
time halved, and
asymptotes at that level.
Time (mins)
6
5
4
3
2
1
0
1
2
3
Order of query
4
Evaluation results
Performance over time: performance normalised over complexity

After just one go with
the CLEF interface,
subjects are highly
proficient in their ability
to compose complex
queries.
By the time they get to
their fourth query,
subjects’ performance
is almost perfect.
Operations
(total - optimal /optimal)

0.5
0.4
0.3
0.2
0.1
0
1
2
3
4
Order of query
Mean : 0.18
Optimal operation = min # of operations needed to compose the query perfectly.
This is a measure of the complexity of the query.
Evaluation – comparison with SQL


Very small scale experiment
Two subjects:





with expert knowledge of the structure,
organisation and content of the CLEF database
highly skilled users of SQL
with minimal experience with WYSIWYM
were given access to the SNOMED and ICD
codes required to build the SQL
Each subject composed a query first in the
CLEF WYSIWYM Interface and then in SQL
Evaluation – comparison with SQL
Subject 1 – Query 1
WYSIWYM: 2.3 mins
SQL: 8.5 mins (incomplete)
Subject 2 – Query 2
WYSIWYM: 4.5 mins
SQL:12 mins (incomplete)
12
10
8
WYSIWYM
SQL
6
4
2
0
Subject 1
Subject 2
Even with a slowly reacting interface, the subjects were much faster composing
queries in WYSIWYM than in SQL
Conclusions

The CLEF WYSIWYM query interface works!

The method is easily acquired.

Investigation shows that it is much easier to use
than current alternatives (viz. SQL).


It is a viable solution to the querying the CLEF
repository.
However ….
Shortcomings and unresolved issues


The current implementation of the interface is
too slow (subjects complained!)
We haven’t yet tested the feedback text:




Could the design be better?
Is it as unambiguous as we think?
Are the queries we currently support the ones
real users will want to ask?
Does the query interface provide sufficient data
coverage?