Download r27_143_medinfo2013_..

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Taming EHR Data
Using Semantic Similarity to
Reduce Dimensionality
Jim Weatherall, PhD
Head, Advanced Analytics Centre, AstraZeneca
Visiting Lecturer, School of Computer Science, University of Manchester
14th World Congress on Medical & Health Informatics, August 2013, Copenhagen
On behalf of the authors:
Leila Kalankesh, School of Computer Science, UoM
James Weatherall, AstraZeneca
Thamer Ba-Dhfari, School of Computer Science, UoM
Iain Buchan, Institute of Population Health, UoM
Andy Brass, School of Computer Science, UoM
Introduction
Problems with mining healthcare data
Large collections not easily
visualised or interpreted
Research
not primary
purpose for
collection
2
J.Weatherall | August 2013
Read Code
Rubric
C10F.
Type II Diabetes Mellitus,
1372.
Trivial smoker < 1 cig/day
bd3j.
Prescription of “Atenolol 25mg tablets”
G20.
Essential hypertension
2469.
Measurement of Diastolic Blood Pressure
246A.
Assessment of Diastolic Blood Pressure
100s of 1000s
of codes
10s of 1000s of
dimensions
Biometrics & Information Sciences | GMD
Data
The Salford Integrated Record (SIR)
 Population ~220,000
 Integrated primary and secondary care
information
 Individual Read Code entries captured in
primary care information systems
 Codes for diagnosis
 Codes for procedures
 All clinical transactions in primary care and
some in secondary care
 Data extract for this analysis based on:
 GP data in date range 2003-2009
 Containing 136M Read code entries
 Selected 24K patients with chronic
conditions
 Containing 443K Read code entries
3
J.Weatherall | August 2013
Biometrics & Information Sciences | GMD
Methods
Semantic Similarity
How alike are the meanings of two terms?
?
Measure depth?
Or not?
Measure ontological distance?
4
J.Weatherall | August 2013
From Sanchez, J.Biomed.Inform, 2011
Biometrics & Information Sciences | GMD
Methods
Semantic Similarity – which method?
An ontology of methods!
Semantic
Similarity
Method
Ontological
5
J.Weatherall | August 2013
Corpus-based
Node-based
Frequency
Edge-based
Context
Hybrid
Proximity
Combined
Biometrics & Information Sciences | GMD
Semantic similarity calculation
The Resnik measure
c  codes( c )count (c)
P (c ) 
N
1
2
3
Term probability, based on
frequency, including descendants
and annotations
IC (c)   log P(c)
Log transformation, gives
“Information Content”
sim Re s (c1, c 2)  IC (CMICA)
IC of “Most Informative
Common Ancestor” gives
similarity measure
P. Resnik, “Semantic similarity in a taxonomy: An information-based measure and its application to problems of
ambiguity in natural language”, J Artif Intell Res, 1999
6
J.Weatherall | August 2013
Biometrics & Information Sciences | GMD
Analysis Plan
Stepwise approach to dimensionality reduction
1
Map patient records from
diagnosis space into a
similarity space
Map patient records into a
2 low-dimensional vector space
via PCA
3
7
J.Weatherall | August 2013
Project patient records onto
low-dimensional vector space
and cluster patients by
similarity
Biometrics & Information Sciences | GMD
Analysis – Step
1
Mapping from diagnosis space to similarity space
p1
p2
…
pn
p1
sim(p1,p1) sim(p1,p2) …
sim(p1,pn)
p2
sim(p2,p1) sim(p2,p2) …
sim(p2,pn)
… …
pn
…
…
sim(pn,p1) sim(pn,p2) …
…
sim(pn,pn)
“The Similarity Matrix”
pi = patient i
sim(pi,pj) = similarity score between patients i and j
8
J.Weatherall | August 2013
Biometrics & Information Sciences | GMD
Analysis – Steps
2
+
3
PCA on the similarity matrix, visualisation & clustering
Natural co-morbidity:
Diabetes is a risk
factor for angina due
to its accelerating
effect on
atherosclerosis
9
J.Weatherall | August 2013
Biometrics & Information Sciences | GMD
Discussion & Conclusion
Review & Outlook
• Patients with similar diagnosis codes are grouped together
• Therefore, the semantic similarity technique works, to some
degree
• Therefore, this is a viable route to dimensionality reduction
in complex healthcare data sets
Transferability of
method?
Population level
characterisation?
10
J.Weatherall | August 2013
New biomedical
hypotheses?
Exploring comorbidity and cotreatment effects?
New data mining
paradigms?
Biometrics & Information Sciences | GMD
Confidentiality Notice
This file is private and may contain confidential and proprietary information. If you have received this file in error, please notify us and
remove it from your system and note that you must not copy, distribute or take any action in reliance on it. Any unauthorized use or
disclosure of the contents of this file is not permitted and may be unlawful. AstraZeneca PLC, 2 Kingdom Street, London, W2 6BD, UK,
T: +44(0)20 7604 8000, F: +44 (0)20 7604 8151, www.astrazeneca.com
12
J.Weatherall | August 2013
Biometrics & Information Sciences | GMD