Download Intuitive querying of e-Health data repositories - National e

Intuitive querying of e-Health data repositories Catalina Hallett, Richard Power, Donia Scott Centre for Research in Computing, The Open University Overview       Background Querying the CLEF database Query editing Answer generation Evaluation Conclusions Background    CLEF (Clinical E-Science Framework) is an MRCfunded project aiming at providing a repository of well organised data-encoded clinical histories Current repository contains about 20,000 records of cancer patients The aim of the CLEF query interface is to provide efficient access to aggregated data for:    Assisting in diagnosis and treatment Identifying patterns in treatment Selecting subjects for clinical trials What does the CLEF database provide       Evidence from about 20,000 patient records, comprising 3.5 million record components (about 5GB of data) 162 queriable fields various text-only records (non-queriable) Two types of data:  Structured  Extracted from narratives by IE Queriable data is encoded according to various medical terminologies (SNOMED, ICD, UMLS) There are approximately 19,500 different medical codes currently used in the database (a relatively small subset of SNOMED and ICD) Queriable data  Structured data:  Demographics:   Laboratory findings:          Prescription drugs Chemotherapy protocol IV chemotherapy Radiotherapy Surgical procedures Diagnoses    Radiology procedure, site, diagnosis, morphology, topography, report, indication, department Treatments:   32 types of haematology findings 51 types of chemistry findings Cytology reports Histopathology reports Imaging studies:   Age, gender, postal district, ethnical group, occupation Clinical diagnosis Cause(s) of death Data extracted from narratives Query interface requirements    Designed for:  casual and moderate users, who are familiar with the semantic domain of the repository but not with its technical implementation  Typically clinicians or medical researchers Should be able to:  Allow the construction of complex queries with nested structures and temporal expressions  Minimise the risk of ambiguities  Offer good coverage of the data types in the CLEF database Should be used with:  Minimal training  No prior knowledge of medical terminologies, formal querying languages, databases Typical queries “How many patients with AML have had a normal count after two cycles of treatment?” “ How many patients with primary breast cancer have relapsed in the last five years? ” “ What is the median time between first drug treatment for metastatic breast cancer and death? ” “ In breast cancer patients, what is the incidence of lymphoedema of the arm that persists more than two years after primary surgical treatment? ” “ What is the average number of x-rays for patients with prostate cancer? ” “ What is the average time between first treatment for cervical cancer and death for patients aged less than 60 at death compared with those aged over 60? ” “How many patients between the ages of 40 and 60 when they were first diagnosed with lung cancer had a platelet count higher than 300 but a white cell count lower than 3 before the 4th cycle of any course of chemotherapy they received during treatment? ” Querying alternatives  SQL:  Not appropriate for the typical CLEF user  Requires deep knowledge of the database structure and content, medical terminologies used in the database  Graphical interfaces:  Have to cope with large number of parameters  Nested structures and temporal restrictions are difficult to express  Natural Language interfaces:  More natural and more expressive than formal querying languages, but…     Sensitive to errors in composition, spelling, vocabulary Normally understand only a subset of natural language Complex queries are difficult to process It is difficult to trace the source of errors in the result The CLEF approach  Similar to Natural Language interfaces, however the user edits the conceptual meaning of a query instead of its surface text  Allows users to easily construct non-ambiguous queries Guides the users towards constructing correct queries only (queries compatible with the content of the database)      It is semi-database independent but very domain specific Based on the WYSIWYM technique (Power et al 98) The query is presented to the user as an interactive text, and it is edited by making selections on various components of the query Each selection triggers a text re-generation process which results into a new feedback text containing the selection the user made Query editing Modelling queries    There are 4 distinct sections of a query:  A description of the subjects (in terms of demographics information and basic diagnosis)  A description of treatments that the subjects received  A description of laboratory findings  An outcome section (what do we want from the group of patients we have just described) Each query element can be expressed as a conjunction or disjunction of same-type query elements, e.g.,:  Cancer of the breast and of the lung  Patients who received chemotherapy and radiotherapy Some query elements can be temporally related to each other, e.g.,:  Patients who received chemotherapy within 5 months of surgery  Patients alive 5 years after the diagnosis Constraining user choices  At each step, users are only given correct choices  Choices are context dependent    Patients diagnosed with [some cancer] in [some body part] User selects [some cancer] => “squamous cell carcinoma” The interface restricts the choices available for [some body part] to those sites where squamous cell carcinoma can develop Dealing with ambiguities    Once a query is constructed, there is only one way it can be interpreted – there is no disambiguation task to be performed … but users may be misled into constructing a different query than they intend to Once a query is constructed and submitted, the user is presented with a re-formulation of the query as a doublechecking mechanism. They then have the option to revise the query or proceed with the retrieval of the results Answer generation    The answer set consists of an age/gender breakdown of the patients that fulfil the query requirements Each additional clinical feature is combined with the age/gender breakdown to provide more detailed information 3 types of rendering:    Text Charts Table Evaluation  Research questions:   Can the WYSIWYM query formulation method be easily learned by users of CLEF? Is it easier to formulate CLEF queries in SQL or with the WYSIWYM query formulation method? Evaluation procedure  Subjects:     We tested the performance of 11 subjects. Subjects had a range of expertise in the CLEF domain -from expert (oncologist) to novice (computer scientist), but most subjects had some medical training. Subjects had no previous experience with the CLEF WYSIWYM query interface, but most were aware of its fundamental principles. Methodology:    Subjects were given a set of four fixed queries to formulate using the CLEF WYSIWYM query interface. The queries were expressed in language as different as possible from the language in the query interface. Each subject received the queries in a different order. Evaluation – data analysis   We recorded  the time taken to compose each query.  the number of operations used for constructing a query and compared it with the optimal number of operations (pre-computed). We analysed whether performance, as indicated by  Speed  Efficiency improves with training (experience). Evaluation results Time to completion  Subjects’ performance improved dramatically with experience. Tim e to com pletion 7  After their first experience of composing a query, subjects’ completion time halved, and asymptotes at that level. Time (mins) 6 5 4 3 2 1 0 1 2 3 Order of query 4 Evaluation results Performance over time: performance normalised over complexity  After just one go with the CLEF interface, subjects are highly proficient in their ability to compose complex queries. By the time they get to their fourth query, subjects’ performance is almost perfect. Operations (total - optimal /optimal)  0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 Order of query Mean : 0.18 Optimal operation = min # of operations needed to compose the query perfectly. This is a measure of the complexity of the query. Evaluation – comparison with SQL   Very small scale experiment Two subjects:      with expert knowledge of the structure, organisation and content of the CLEF database highly skilled users of SQL with minimal experience with WYSIWYM were given access to the SNOMED and ICD codes required to build the SQL Each subject composed a query first in the CLEF WYSIWYM Interface and then in SQL Evaluation – comparison with SQL Subject 1 – Query 1 WYSIWYM: 2.3 mins SQL: 8.5 mins (incomplete) Subject 2 – Query 2 WYSIWYM: 4.5 mins SQL:12 mins (incomplete) 12 10 8 WYSIWYM SQL 6 4 2 0 Subject 1 Subject 2 Even with a slowly reacting interface, the subjects were much faster composing queries in WYSIWYM than in SQL Conclusions  The CLEF WYSIWYM query interface works!  The method is easily acquired.  Investigation shows that it is much easier to use than current alternatives (viz. SQL).   It is a viable solution to the querying the CLEF repository. However …. Shortcomings and unresolved issues   The current implementation of the interface is too slow (subjects complained!) We haven’t yet tested the feedback text:     Could the design be better? Is it as unambiguous as we think? Are the queries we currently support the ones real users will want to ask? Does the query interface provide sufficient data coverage?

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Intuitive querying of e-Health data repositories - National e